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MEASUREMENT OF WRITING ABILITY AT 
THE COLLEGE-ENTRANCE LEVEL: 
OBJECTIVE VS. SUBJECTIVE 
TESTING TECHNIQUES’ 


EDITH M. HUDDLESTON ** 
Educational Testing Service 
New York City 


A. Background of the Problem 





ONE OF THE most fundamental controver- 
sies in the history of psychometrics has been 
the debate on the relative merits of objective vs. 
subjective measuring techniques. The subjec- 
tive, free-answer tests have always had the ad- 
vantage of apparent or ‘‘face’’ validity: the tra- 
ditional essay examinations are of the work- 
Sample type, and require the examinee independ- 
ently to summon and organize his relevant knowl- 
edge. Thus the essay test has been thought of 
as a ‘‘natural’’ task, allowing a direct approach 
to important goals. But as far back as the 1880’s 
it was realized that this otherwise ideal testing 
device was beset with the pitfall of unreliability 
(17). Early studies (14,54) showed that there 
was considerable discrepancy among teachers’ 
marks, and a vast array of later studies confirm 
this fact. In 1929 Ruch provided a table sum- 
marizing 285 coefficients of reliability for essay 
examinations, with a median reliability of .59 
(44, p. 107). This corroborates the experience 
of later investigators. In 1947 Adkins (1, p. 6) 
wrote: ‘‘Essay tests, no matter what their mer- 
its may be, are commonly considered impruc- 
tical if the number of subjects is at all sizable, 
because of the great difficulty in scoring them 
reliably and because of the time required to 
score them.’’ A review of the literature indi- 
cates that the unreliability of essay examinations 
is most pronounced in the area of English com- 
position, while efforts to improve reliability 





for achievement examinations covering specific 
content (e.g., history or physics) have been 
somewhat more successful. Stalnaker (34) who 
proposed an ‘‘analytical’’ method of scoring es- 
say tests as a means of raising reliability, re- 
ported that for English composition the method 
is unsatisfactory. 

The objective test has its origins inthis dem- 
onstrated need for reliability. With the objec- 
tive test the problem of ‘‘reader reliability’’ 
(agreement of readers with one another) disap- 
pears, and the scoring of the testis simply a 
matter of clerical accuracy. There still re- 
mains an error of measurement due to imper- 
fect test reliability but it has been demonstrat- 
ed that the objective test is generally much more 
reliable. A good objective test has a reliability 
of at least .85, and frequently of .90 or more. 

Historically, however, the objective test has 
been severely criticized on the grounds that it 
presents the examinee with a task which is art~- 
ificially oversimplified. It has been charged 
that the examinee is inadequately measured when 
he is required merely to choose his answer 
from among a number of answer-choices which 
are set down for him. The proponents of objec- 
tive tests have been challenged to show that such 
measuring devices can be inherently valid in 
spite of the apparent predigestion of much of the 
subject-matter to be tested. 

To a large extent this challenge has been sat- 
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isfactorily met. Objective-type questions have 
been developed which do require thought and or- 
ganization on the part of the examinee; statisti- 
cal analysis has indicated that abilities tested by 
objective tests are frequently similar to those 
which essay examinations aim to measure; in 
many Cases objective tests have been found su~ 
perior because they can sample a wider area of 
subject-matter ina given length of time; and in- 
evitably the objective tests can be more reliably 
scored, 

The testing of ability in English composition, 
however, is one area in which the dilemma has 
not been so easily resolved. The difficulties 
have been summarized in an article by Greene 
(24) in which he emphasized the need for improve- 
ment in the validity and reliability of composition 
scales of the essay type, but expressed unwill- 
ingness to accept objective measurement as an 
alternative because most objective tests emphas- 
ize mechanical skills at the expense of style and 
quality, and because pupils may ‘‘respond c or - 
rectly in objective tests to items which they do 
not use correctly in their own expression. ’’ 

Greene's view is typical of many others. Yet 
there has thus far appeared no evidence to indi- 
cate that adequate essay tests can be devised. 
While reader reliabilities of .85 or higher have 
occasionally been reported (2, 12, 39, 48, 50, 57), 
the preponderance of the evidence indicates that 
this is exceptional. The successes reported by 
Noyes (39) and by Stalnaker (48,50) are to be 
weighed against their contrary experience (40, 
41,51,52) and considered in the light of their con- 
clusions, after years of working with the prob- 
lem, that no satisfactory solution had been at- 
tained (34,40,52). Likewise, the reader relia- 
bilities of . 85 and . 94 reported by Traxler and 
Anderson (57) are mitigated by the facts that only 
two readers were involved and that the intercor- 
relation between the two comparable essay tests 
was only .60. Traxler and Anderson explained 
that ‘‘reliability of pupil performance is relative- 
ly low, ’’ but the hypothesis can be entertained 
that the readers may have reached their highly 
consistent scores by ignoring certain aspects 
of pupil performance which should have been eval- 
uated. 

Probably there is no fund of experience in the 
field which is more extensive than that of the Col- 
lege Entrance Examination Board, which for 
years has put its best resources into the devel- 
opment of its English Composition test; yet in 
the Board’s annual report for 1946 (10, p. 7) the 
executive secretary stated that ‘‘the problems 
involved in developing a reliable essay examin- 
ation are, if not unsolvable, at least far from 
solved at the present time.’’ For a long paper 
ona Single topic, the Board readers achieved a 
reader reliability of .55 which is, as Stalnaker 
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said, ‘‘about the same as the relationship be - 
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tween height and weight. ’’ (52) Six other themes 
(40) were read with reliabilities, respectively, 

of .67, .66, .83, .69, .58, and .59. Further 
unpublished data in the Board’s files shows that 
for nine one-hour examinations consisting of 
three or four short essay questions each, the 
reader reliabilities for total score ranged from 
.68 to .89. Note that the figures below represent 
‘‘reader’’ reliabilities only; when the (unknown) 
test unreliability is also taken into account, the 
total unreliability is even greater. It is interest- 
ing to observe that the correlations with the Ver- 
bal score of the Scholastic Aptitude Test were 
almost as high as the reader reliabilities. 


Reader Correl’n with 


Test Reliability S.A.T. Verbal 





September 1944 

December 1944 . . 67 
April 1945 , . 55 
June 1945 

September 1945 ‘ 

December 1945 ‘ . 61 
April 1946 ? . 58 
June 1946 ; . $5 
December 1946 , .57 


Intercorrelations among the Board’s essay 
questions tended to be low. In view of the fact 
that the questions were designed to measure com- 
parable abilities, the low intercorrelations lead 
one to suspect that the true test reliability was 
considerably lower than the reader reliability. 
The data from the Board's files (see page 168) 
refer to the essay tests of September and Decem- 
ber 1944 when the reader reliabilities were com- 
paratively high. (The reader reliabilities are 
shown in the diagonats. ) 

The available evidence indicates that the ex- 
periences of others corroborates that of the 
Board. Starch and Elliott (54) had 142 high 
school English teachers each mark two compo- 
sitions, and found variation as high as 35 to 40 
percentage points between two independent read- 
ings of the same theme. Hulten (30) found that 
28 high school English teachers who graded an 
English composition twice at an interval of two 
months were quite inconsistent; 15 teachers 
showed a shift in judgment which would have 
changed the paper from a passing grade on the 
first test to a failing grade on the second. A 
number of other writers have reached similar 
conclusions (21, 25, 28, 42, 63), while some 
have held out hope for improving reader reliabil- 
ity (29, 45, 53, 58, 60, 61). However, there 
is no convincing evidence in the literature that 
these hopes have come to fruition. 

In contrast to the discouraging status of the 
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essay test in English composition, there have 
been increasing indications of the effectiveness 

of the objective-type test. Tests of Englishgram- 
mar and rhetoric and tests of verbal ability have 
shown high reliabilities and high validities, and 
there is evidence to indicate their superiority. 

Reported reliabilities for objective English 
tests include representative examples (see page 
168). The effectiveness of the objective English 
tests in predicting appropriate criteria is sum- 
marized in the list of validity studies (page 169). 
Averill (4) and McGann (37) have reported their 
success in using objective English tests for guid- 
ance purposes. Studies by Willing (62,63) led 
him to conclude that tests in the recognition of 
errors are reasonably good instruments for pre- 
dicting the average number of formal errors that 
pupils will make in free composition, but not for 
predicting the specific kinds of errors. 

Pressey (43) conducted a study in which he 
compared the validity of essay tests with a test 
of his own construction. His test included ob - 
jective items and items in which the student was 
required to revise given material. He secured 
ratings in written English from three experienced 
teachers, the average of whose ratings was used 
to represent the general ability of the children 
in written English. With these ratings as criter- 
ia, he determined the following validity coeffic- 
ients: scores on narrative composition, .29; 
scores on descriptive composition, .33; scores 
on his own test, .72. A study by Hartson (27) 
using the same types of tests corroborated Pres- 
sey’s findings. McKee (38) and Stalnaker (46) 
offer additional evidence that objective English 
tests are more valid than are scores on themes, 
while a study by Flemming (18) indicated that a 
revision test showed a higher relationship to Eng- 
lish grades than did a composition rated by the 
Hudelson scale. Willing’s work (62) showed that 
a revision test was superior to free composition 
in that it disclosed more student weaknesses. 

With regard to the charge that an objective 
test gives a distorted picture in that it tests rec- 
ognition rather than recall, a study by McCull- 
ough and Flanagan (36) gives contradictory evi- 
dence. These investigators found a correlation 
of . 88 between the usage section of the all-objec- 
tive Cooperative English Test, Form OM, and 
the usage section of the Cooperative English Test 
1937, which was largely a correction-of-error 
test, permitting free response written in the test 
booklet. Using the Wisconsin tests of grammat- 
ical correctness, Leonard (33) reported a cor- 
relation of . 68 between the objective and the free- 
response form. And Stalnaker (47) concluded 
that for at least one group of students a test of 
ability to classify sentence faults gave virtually 
the same score as did a test of ability to correct 
them. 

The evidence with respect to the characteris- 
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tics of verbal-factor tests leads to the hypoth- 
esis that the verbal factor may be of power 
value in predicting success in English compos-~ 
ition. Verbal tests are typically highly reliable; 
for example, the verbal section of the C. E. E.B. 
Scholastic Aptitude Test was reported to have a 
corrected odd-even reliability coefficient of .96 
(10, pp. 30-33). A factorial study by Carroll 
(9) as well as the Board’s own developmental 
work indicates that the verbal score of the Scho- 
lastic Aptitude Test measures primarily verbal 
ability. Correlations between verbal tests and 
English grades have been reported by several 
investigators (see page 170). 

There is some evidence regarding the inter- 
relationships among objective English tests and 
verbal tests. Crawford and Burnham (11) re- 
ported correlations between verbal and objective 
English tests as ranging between . 65 and . 83. 
Doppelt (13) obtained a correlation of .72 between 
a verbal reasoning test and a grammar test. 
Krathwohl (31) found that a vocabulary test cor- 
related .58 with each of two objective tests in 
English expression. McCullough and Flanagan 
(36) reported a correlation between vocabulary 
and usage of .69. These correlations are high 
enough to suggest the hypothesis that objective 
English tests are heavily loaded with the verbal 
factor. 

While the present review of the literature 
gives some important insights into effective 
measurement of writing ability, it points to the 
necessity for a large-scale investigation in which 
all variables are pooled to determine their inter- 
relationships with one another. Particularly in- 
determinate is the question of whether objective 
English tests or verbal-factor tests are more 
closely related to ability in English composition. 


B. Purpose of the Investigation 





As the preceding review of literature indi- 
cates, the problem of measuring ability in Eng- 
lish composition has been attacked piecemeal 
with the consequence that little light has been 
thrown on its over-all aspects. Tests with vary- 
ing degrees of promise have been evaluated more 
or less effectively, but on separate populations. 
This situation is understandable in view of the 
great expense which would ordinarily be involved 
in a large study combining a number of variables 
and employing a sizeable test population. The 
present writer is fortunate in having had the op- 
portunity and resources necessary to carry out 
such an investigation. 

The studies reported in the present investi- 
gation were designed to meet the needs indicated 
by the gaps in present knowledge, and were con- 
ducted at the request of the College Entrance Ex- 
amination Board. The initial problem was to 
work with teachers in developing an acceptable 
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September 
1944 


Intercorrelations Among Questions 
1 : 

Question 1 

Question 2 


Question 3 


Question 4 


Question 1 
Question 2 
Question 3 


Question 4 


Reported Reliabilities for Objective English Tests 


Source of 
Information 


Asher (3) 


Buros (6, No. 1269. 1) 


California Test Bureau (8) 


Lindquist (35) 


Stalnaker (49) 


Traxler (56) 


World Book Company (64) 


Test 
Kentucky English Test 


College English Test: Nation- 
al Achievement Test 


Test of English Usage 


Test of Correctness and Ap- 
propriateness of Expression 


University of Chicago English 
scholarship examination of 
May 1934 


Cooperative English Test A, 
Mechanics of Expression 
(Form R) 


Barrett-Ryan-Schrammel 
English Test 


Reliabili 


. 93 (retest) 


. 88 (Spearman-Brown) 


.94, .95 (Kuder-Richardson) 


.92, .94 (Spearman-Brown) 


. 88 (Spearman-Brown) 


. 93 (Spéarman-Brown) 


. 88, . 89 (alternate form) 


. 94, .91, .91 (Spearman- 


Brown) 
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Investigator 
Asher (3) 


Berg, Johnson and 
Larsen (5) 


Cade (7) 


Edmiston and 
Gingerich (15) 


Fletcher and 
Hildreth (19) 


Glatfelter (22) 


Hartson (27) 


McCullough and 
Flanagan (36) 


Wagner and Strabel 
(59) 


HUDDLESTON 


List of Validity Studies 


Validity 
Test Coefficient 


Kentucky English Test . 13, .62 


Cooperative English Test A, 
Mechanics of Expression 
(Form Q) 


Conkling and Pressey Diag- 
nostic Tests in English 
Composition (Form C) 


(Form D) 


English Usage Test of the 
Ohio State Every Pupil 
Tests 


Ohio State University Eng- 
lish Placement Test: 


Usage Section . 28 
.20 to .29 


Grammar Section .23 


- 11 to .22 


Cooperative English Test: 
Usage score . 66 


Pressey Diagnostic Tests 


in English Composition . 63, . 43 


Cooperative English Test, 
Form OM: Usage score 


54, . 62 


Cooperative English Test, 
Subtest Mechanics 


Criterion 


Freshman English 
grades 


Freshman English 
grades 


Freshman English 
grades 


Freshman English 
grades 


Scores on a compo~ 
sition test 


Freshman English 
grades 
Instructors’ ratings 


Freshman English 
grades 
Instructors’ ratings 


Freshman English 
grades 


Freshman English 
grades 


Teachers’ ratings 
(12th grade) 


Freshman English 
grades 
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Correlations Between Verbal Tests and English Grades 


Investigator 
Ca rroll (9) 


Crawford and Burnham (11) 


Ellison and Edgerton (16) 


Garrett (20) 


Goodman (23) 


Grinnell (26) 


Hartson (27) 


Landry (32) 


Thompson and Haines (55) 


Wagner and Strabel (59) 


Correlation with 
Test English Grades 





C.E.E.B. S.A.T. Verbal .39 


C.E.E.B. S.A.T. Verbal . 37 to . 64 
Yale Verbal Reasoning Test . 33 to .54 


Thurstone’s Verbal Factor 


Thorndike’s CAVD Test: 
Completion score 
Vocabulary score 
Directions score 


Thurstone’s Verbal Factor . 40 to .55 


Inglis Tests of English Vo- 
cabulary . 53 


Inglis Tests of English Vo- 
cabulary . 60 


C.E.E.B. S.A.T. Verbal 51 


A.C. E. Psychological Ex- 
amination: Linguistic score .47, .63 


A.C.E. Psychological Ex- 
amination: 
Opposites 
Completion 
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criterion of ‘‘ability to write.’’ It must be rec- 
ognized that in the present state of our knowl- 
edge and for a long time to come, this ability 
cannot be defined in great detail; it is a compre- 
hensive ability, and one which can be expected 
to vary in the same person from time to time 
and from task to task (34, p. 504). Hence, the 
writer decided that ratings of writing ability 
would be the best type of criterion which could 
be developed, particularly since the corps of 
College Entrance Examination Board readers 
provided an excellent source of raters who had 
had long experience in the evaluation of writing 
and who also had classes of students with whose 
abilities they were well acquainted. By working 
with these raters individually and intensively the 
writer expected to secure a conscientious and 
well-considered rating in each case. 

The construction of the test material to be 
evaluated and analyzed was planned as follows: 
(1) The writer felt that much of the criticism of 
available objective tests in English was valid as 
the questions on such tests are frequently picay- 
une. The writer proposed, therefore, to devel- 
op some objective questions to test the abilities 
normally covered by such tests but to emphas- 
ize the type of question which would give the 
greatest possible opportunity for the student to 
restructure the written material presented and 
which would come as close as possible to meas- 
uring those attributes which essay tests aim to 
measure. An aim of the study was to test the 
hypothesis that such a test would be a better 
measure of writing ability than an essay test. 
(2) Since the highly structured revision-t ype 
material had shown promise, it was decided to 
include it here. Such test material presents the 
examinee with a definite and specific task, allow- 
ing the readers very little latitude for difference 
of opinion. It was desired to determine if this 
semi-objective essay-type question, which would 
probably be highly reliable, would also be valid. 
(3) The writer wished to discover the extent to 
which a purely ‘‘verbal’’ test would measure a- 
bility to write. The hypothesis was made that 
all of our tests of writing bility, both objective 
and subjective, are measuring verbal ability 
and are measuring it less well than the tradition- 
al verbal test measures it. The hypothesis to 
be tested was that writing ability, insofar as it 
could be measured atall, is verbal ability. 
(4) Relationships among all variables were tobe 
studied. 


C. The General Study Plan 





The basic problem for investigation grew out 
of a concrete situation in which the abilities of 
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students were being evaluated for the guidance 
of college admissions and placement officers. 
Fortunately, this concrete situation was one 
which lent itself readily to providing large ex- 
perimental populations and generous coopera- 
tion from professional people in the schools and 
colleges and on the College Entrance Examina- 
tion Board staff.1 The study, therefore, was 
designed in such a way as to take full advantage 
of the Board’s facilities and of its established 
testing programs and procedures. 

The Board’s largest testing program takes 
place in the spring of the year when secondary 
school seniors are examined for admission to 
college in the following September. The Eng - 
lish essay examinations are read and scored by 
a group of secondary school teachers and col- 
lege instructors who come to the Princeton lab- 
oratory for this purpose. The readers work 
under the general direction of the Committee of 
Examiners in English Composition. At any one 
session most of the readers are people with 
years of previous experience, although new 
readers join the group from time to time. The 
work is highly organized. The readers workin 
groups of about half a dozen, each group having 
a table leader whose responsibility it is to spot- 
check the work of the others and to confer with 
other table leaders and with the Chief Reader 
to insure the maximum amount of consistency 
in evaluation. Ordinarily the first day is spent 
in reading and discussing ‘‘sample’’ papers in 
order to arrive at mutual agreement on the ap- 
plication of standards. Prior to this time the 
table leaders have already spent considerable 
time in similar practice. 

The study plan aimed to utilize the April 
program to provide an opportunity for compar- 
ison of different types of test material on a large 
population. Before this could be done, however, 
some preliminary evidence was needed to justi- 
fy the presentation for the first time of objec- 
tive test items in a Board examination in English 
Composition. Hence, a preliminary study was 
planned to obtain some validity data on the pro- 
posed item types and to test the effectiveness 
of the procedures proposed for use in April. 

As a method of attack on the problem of 
measuring writing ability, the two parts of the 
investigation are to be regarded as mutually 
complementary. Not every aspect of the prob- 
lem is dealt with in each of the studies, but the 
general conclusions are derived from a synthesis 
of the findings of both. 

The function of Study I was to construct the 
experimental test material, to develop a criter- 
ion measure of ability to write, to study the 
characteristics and interrelationships of the 





1. The former laboratory staff of the College Bntrance Bxamination Board is now a part of the 
Bducational Testing Service. 
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various types of material, and to judge whether 
the obtained evidence justified the continuation 
of the investigation ona larger scale. While 
there were fewer subjects available for Study I, 
which utilized college freshmen, the possible 
testing time was longer-——-150 minutes as com- 
pared with 60 minutes for the English test in 
Study 0. Thus, in Study I it was possible touse 
enough material to gain information on several 
breakdowns of the objective items. Study Ilem- 
ployed the methods worked out in Study land for 
the secondary-school senior group provided a 
more definitive answer to the broader problems 
under investigation. 


D. Study I 
(a) The Test Population 


Sixteen college-freshman Englishclasses 
participated in the study, ranging in size from 
22 to 33 students per class. The students in 
these classes were tested during October and 
November, shortly after the beginning of their 
first semester of college study in English. Five 

eparate educational institutions were represent- 
ed, The classes may be described as shown be- 
tow. 


Institution A. Large middle-western college, 
coeducational: 


N = 25 
N = 27 
N = 33 
N = 24 


Class 1 
Class 2 
Class 3 
Class 4 


Average students 
Above-average students 
Below~-average students 
Average students 


Institution B. Large middle-western university, 
coeducational: 


N = 25 
N = 22 
N = 24 
N = 26 


Above average students 
Average students 
Average students 
Below-average students 


Class 5 
Class 6 
Class 7 
Class 8 


Institution C. Large eastern university, coed- 
ucational: 


N = 24 
N = 24 


Class 9 Above~average students 
Class 10 Above-average students 


Institution D. Large eastern university, male 
undergraduate college: 


N = 29 


Class 11 Average students 
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N = 27 
N = 30 


Class 12 Average students 
Class 13 Average students 


Institution E. Small eastern college for women: 
N = 23 


N = 24 
N = 26 


Class 14 Average students 
Class 15 Average students 
Class 16 Average students 


The indicated classifications of students as 
to ability were made by the institutions them- 
selves and thus can be expected to have a differ- 
ent meaning in each institution. In the sixteen 
classes nine instructors were represented. 
Groups of classes taught by same instructor 
were: classes 1 and 4; classes 5 and 6; classes 
7 and 8; classes 11, 12, and 13; and classes 14, 
15, and 16. 


(b) Description of Variables 


1. Objective English Test: This test was ad- 
ministered in two sections, —Section I, 50 min- 
utes, 149 scorable units; Section I, 10 minutes, 
32 scorable units. Since several scorable units 
may often be found in a single sentence, this 
material moves quite rapidly in the testing sit- 
uation. 

The leading concept in the construction of 
this material was that the student should be giv- 
en the maximum opportunity to manipulate the 
various elements of the sentences and paragraphs 
presented to him. It had been the writer’s ex- 
perience in working with other tests that objec- 
tive items can be highly rigid or highly flexible. 
The emphasis on flexibility in the present test 
was aimed at adopting as many of the desirable 
characteristics of the free essay examination 
as was possible. Approximately forty percent 
of the questions were culled from previous tests 
constructed for special programs under the 
Board’s jurisdiction; the remaining sixty per- 
cent were tailor-made for the present investi- 
gation. 2 

The items were designed to fall in four gen- 
eral categories: punctuation, 27 items; idiomat- 
ic expression, 33 items; grammar, 47 items;3 
and sentence structure, 70 items. The great- 
est flexibility was attained in sentence structure, 
which involved such points as parallelism of sent- 
ence elements, illogical comparisons, misplaced 
and dangling modifiers, subordination of elements, 
wordiness, and incomplete, stringy, and run-on 
sentences. 





. Of the new items, approximately two-thirds were constructed by the writer and approximately 


one-third by Professor Scott B. Blledge. 


. There were actually 5) grammar items in the test, but 4 were omitted from scoring since it 
was found that this omission would greatly reduce the clerical labor involved in combining 


scores. 
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In format the test consists of sentences and 
paragraphs in which certain portions are under- 
lined and numbered. Each underlined portion 
is accompanied by suggestions of several pos - 
sible ways in which that portion might be writ- 
ten. In each case the student is asked to decide 
which of the suggested answers is correct, or 
which sounds best in the sentence. The follow- 
ing examples will illustrate the potentialities of 
this device:4 


The success of any experiment is jeop- 

ardized by inaccuracy one error may in- 
i 

validate all the conclusions found. 


2 





(1) inaccuracy one 
(2) inaccuracy, one 
(3) inaccuracy; one 


(1) found 
(2) which are found 
(3) OMIT 


I wished to go and return on the same 
day for several reasons. 
3 





(1) (Leave where it is now. ) 

(2) (Place at beginning of 
sentence. ) 

(3) (Place after ‘‘return. ’’) 


2. Essay Questions: The three 20-minute es~ 
Say questions were proposed by the Committee 
of Examiners in English Composition, and may 
be regarded as typical of essay questions prev- 
iously set by them. The topics were intended 
to be sufficiently universal that all students would 
have pertinent experiences upon which to draw. 
The completed papers were sent to Princeton 
where they were scored by a group of the Board’s 
regular readers according to typical Board 
Standards. Each essay was rated for: material 
and organization; spelling; punctuation; s y ntax; 
vocabulary; and sentence structure. 

The three essay questions were worded as 
follows: 


I. In about 150 words write a well-organized 
paragraph on the subject, ‘‘A Change is Need- 
ed.’’ You may advocate any change which you 
seriously think should be made. Support your 
argument with as much detail as possible. 
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II. The statement ‘‘All men are created equal’’ 
may be interpreted in several different ways, 
because in the statement the word equal may 
have several meanings. Explain with the 
help of examples several of the meanings of 
equal and indicate which of these meanings 
you think makes the statement most nearly 
true. Your answer should be in the form of 
a well-unified paragraph of about 150 words 
that has for its subject what you think this 
statement means. 


. Ina paragraph of about 150 words discuss 
a serious error which you think parents may 
make in the rearing of a child, and indicate 
the possible consequences of this error. Use 
your own experience or your observation of 
friends and acquaintances to support your 
opinion, 


3. Paragraph-Revision Test: This test was 
devised in an attempt to bridge the gap between 
the objective and free essay forms. Eventhe 
most intensive self-discipline by the readers 
leaves great latitude to the individual reader in 
scoring essays on topics such as those listed 
above. An important criticism of the objective 
test, on the other hand, is that it gives the stud- 
ent several alternatives from which to choose 
instead of allowing him to write naturally. In 
the paragraph-revision test the student has a 
complete paragraph to work with, which he is 
to rewrite as a whole. Specific imperfections 
are planted in the paragraphs, which the student 
may or may not recognize. It is up to the stud- 
ent to judge the correctness of the style and to 
demonstrate his ability to improve upon it. For 
the reader the task is eased by the fact that he’ 
has a series of specific points upon which to 
rate the student. There is an obvious limitation 
in the variety of responses which are possible, 
and the readers may agree in advance on the 
score value of each type of response. 

The two paragraphs used, under a total time- 
limit of 20 minutes, are as follows: 


Directions: In each of the paragraphs below 
you are to assume that the first and last sent- 
ences are satisfactory as they stand, but that 
the material between the first and last sentences 
needs to be rewritten in the interest of correct- 
ness and good style. In your answer booklet, 
rewrite each paragraph, making whatever changes 
you think desirable in order to produce a smooth 
well-written, and well-organized piece of work. 





4. For additional illustrations see Appendiz I. The 


(1) choice 3; (2) choice 3; (3) choice 3. 


answers to the above questions are: 
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You need not keep the same sentence divisions, 
or the same order of presentation, but it is im- 
portant to preserve completely the original 
writer’s meaning. Make no changes in the first 
avd last sentences. 

} 

A 
One of the most remarkable things 

about the Chinese is their power to se- 
4 cure the affection of foreigners. Nearly 
} every Single person who is a native of 
Europe likes China, those who both 
come only as tourists and those who 
Jlive there for many years. In spite of 
the Anglo-Japanese alliance, I can’t 
hardly recall one single Englishman in 
the Far East who liked the Japanese as 
well, The obvious evils are strikingly 
obvious to whomever has just recently 
rrived, such as the people are begging, 
terrible poverty existing, disease being 
very prevalent, and the anarchy in the 
politics, which are also corrupt. The 
Westerner’s strong desire to reform 

these evils does not, however, affect 

his love for the people. 





B 

For two days and nights the havoc 
raged unchecked through all the church- 
es of Antwerp and the neighboring vil- 
lages. There weren't hardly any stat- 
ues or pictures which escaped destruc- 
tion. Rubens, fortunately, who was the 
)llustrious artist whose labors were 
destined to enrich, in the next genera- 
tion, and ennoble the city, most pro- 
found of colorists, and the most dramat- 
ic of any artists, had not been born yet. 
Of the treasures which existed, how- 
ever, the destruction was complete. 


4. Verbal Test: The verbal items were anto- 
nyms, chosen from the Board’s Scholastic Apt- 
itude Test to represent the range of difficulty 
covered by that test. This was a 30-item test, 
timed at 10 minutes. Since Study II was to in- 
clude the complete S.A. T., it was felt thata 
minimum portion of the available testing time 
in Study I should be occupied by verbal items. 
The plan was to depend chiefly on Study II for an 
evaluation of the relationship of verbal ability to 
writing ability. However, the 30 items are suf- 
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ficient to give a rough indication of what might 
be expected in Study II. 

The following items are similar to the ones 
used in the study:5 


Directions: Each question in this part consists 
of a group of four words, two of which are ap- 
proximately opposite to each other in meaning. 
Decide which two words in each groupare most 
nearly opposite, and blacken the space beneath 
the corresponding pair of numbers onthe answer 
sheet; i.e., mark the space between the dotted 
lines beneath ‘‘1-2’’ if words numbered 1 and 2 
are opposite, beneath ‘‘3-4"’ if words 3and 4 
are opposite, etc. Mark only ONE set of dotted 
lines for each question. 


1. 1-qualified 2-unfit 3-healthful 4-primitive 

2. 1-circumscribed 2-tedious 3-senile 4-inter- 
esting 

. l-authentic 2-mechanical 3-spurious 4-pro- 
ductive 


5. Instructor’s Ratings: The major criterion 
consisted of ratings by English instructors of 
their students’ ‘‘ability to write.’’ In order to 
secure a definition of writing ability which could 
be accepted and applied by instructors who might 
vary considerably in their personal definitions, 
it was decided to limit the definition to a com- 
mon core. The instructors were asked to ‘‘con- 
sider only the students’ ability to write the kind 
of expository prose that students are called up- 
on to write in college, and to assume that each 
student is writing on a subject which is congen- 
ial to him.’’ It was also felt that this restricted 
definition was desirable in that it designates that 
aspect of writing ability which is most important 
throughout the average student’s college career. 

The basis for these ratings was discussed at 
considerable length with each instructor, in or- 
der to assure that all had the same concept of 
the definition to be applied.6 The informality of 
the discussions with the instructors was import- 
ant in gaining rapport and in maximizing the in- 
structors’ understanding; often a main point was 
volunteered by the instructor rather than being 
told to him. The following points were emphas- 
ized in each interview: 


1. Do not judge the students’ ‘‘imaginativeness’’ 
or ‘‘creative ability. ’’ 





5. Since the actual items used are the confidential property of the Board, the samples offered 


are from the Board's Bulletin of Information. 
1-2; (2) 2-4; (3) 1-3. 


The answers to the above questions are: (1) 


6. Professor Blledge's work toward developing these criteria is gratefully acknowledged. 
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2. Insofar as possible, eliminate the intelligence 
factor from the rating. 


. Do not consider factors other than writing 
ability which ordinarily go into Englishcourse 
grades. ‘‘Try to rate as if you were nota 
teacher. ’’ For example, if you are making a 
particular effort to teach punctuation to a 
class, you will give a low course grade to 
the student who does not learn what you are 
teaching; however, in making these ratings, 
be careful not to let any single factor carry 
too much weight. Eliminate from consider- 
ation such things as behavior in class, effort, 
promptness in handing in assignments, pro- 
ficiency in literature, etc. 


- Do not attempt to predict the results of the 
C.E.E.B. English Compostion Test or of 
these experimental tests. Do not give a stud- 
ent a low rating because you know he will not 
do his best work in an examination situation, 
but base your rating on the assumption that 
the student is handling a normal-lengthassign- 
ment under ideal conditions. Base your rat- 
ing on the student’s ability to do sustained 
writing for an hour or two—and not upon his 
ability to dash off a twenty-minute theme of 
the C. E.E.B. type; the reason for this is that 


the purpose of the study is to measure the ex- 
tent to which the C. E. E. B. test scores indi- 
cate the students’ ability in normal writing 
situations. 


These ratings were obtained after the instruc- 
tors had had about three months of class work 
with the students concerned. The instructors 
had several samples of each student’s work on 
previous writing assignments (not the experi- 
mental tests) and they were encouraged to refer 
to these liberally whenever they wished to re- 
fresh their memory regarding a particular stud- 
ent’s performance. 

Two different sets of ratings were obtained. 
First, the instructor simply placed the students 
in rank order relative to one another. After he 
had done this, he was then asked to compare 
every student in a class with every other student 
in that class, stating each time whether the first- 
named student was better or poorer than the 
second-named student; every pair was later pre- 
sented again with th~ two students being named 
in reverse order. Each student’s paired-com- 
parison rating was taken as the total number of 
times he had been rated as better than another 
student. The two types of ratings were desired 
in order to discover the extent to which the 
paired-comparison technique would yield the 
same results as the simpler rank-order method. 
If the more elaborate method were found to in- 
troduce no significant changes in the ratings, 
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then the tedium of that painstaking method could 
be dispensed with in Study II. 

6. English Course Grades: At the end of the 
semester, each of the instructors supplied a 
copy of the final course grades for each class 
used in the study. This secondary criterionwas 
chosen because the English Composition Testis 
intended to function as a measure of past achieve- 
ment in English courses and as a predictor of 
probable success in future English courses, 
Thus the correlations of the predictor variables 
with course grades were considered to be of in- 
terest despite the extraneous factors which en- 
ter into classroom grading. 


(c) Results 


In order to reduce clerical expenses it was 
decided to drop Section II of the Objective Eng- 
lish Test from the analyses which follow. The 
total Objective English Test score will, there- 
fore, be represented by the Section I score. The 
149 items in Section I included 27 punctuation 
items, 20 idiomatic-expression items, 47 gram- 
mar items, and 55 sentence-structure items. 
For analyzing the different types of objective 
items, however, the subscores incorporated 13 
idiomatic-expression items and 15 sentence- 
structure items from Section II. Accordingly, 
the English subscores represent: 


27 punctuation items (‘‘P’’) 

33 idiomatic-expression items (‘‘I’’) 
47 grammar items (‘‘G’’) 

70 sentence-structure items (‘‘S’’) 


An explanatory note is also in order regard- 
ing the reporting of the class N’s in the tabular 
data which follow. Not all students in eachclass 
took all the tests, which were presented on dif- 
ferent days; additional drop-outs occurred be- 
cause some students did not receive end-of-sem- 
ester course grades. In the Paragraph-Revision 
Test, some students made no response to Par- 
agraph B. Furthermore, the instructor inClass 
09 was unable to allow time for Section II of the 
Objective English Test, and the instructor in 
Class 10 omitted all of the objective English 
and verbal material. The writer decided that 
it would be desirable to preserve the maximum 
number of cases for each intercorrelation, even 
though all intercorrelations for a class would not 
be based on the same N. For each intercorrel- 
ation, therefore, the corresponding N is report- 
ed. 

1. Levels of Performance: Table I gives the 
mean and standard deviation on each variable 
for each class. Inspection of these data indi- 
cates that the difficulty ranges of the various 
measures were sufficiently great to allow for 
discrimination within the classes as well as to 
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TABLE I 


MEANS AND STANDARD DEVIATIONS OF ALL VARIABLES (BY CLASS) 
(N) = Number of Cases 





Objective English Test 
Section I ‘“*«Pp”’’ Score **I’’ Score 


S.D. 





Mean 











2.17 
.10 
.54 
.47 
. 36 
. 93 
. 69 
. 99 
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TABLE I (Continued) 






177 













Essay I 





Essay II 








Mean 


S. D. 





S.D. 











PPP PITS PSOE SS eww 


83 
24 
31 
21 
06 
32 
72 
15 
46 
95 
18 
23 
80 
17 
11 
71 


25 
27 
33 
23 
25 
19 
22 
25 
24 
24 
25 
26 
29 
22 
23 
24 








14. 
15. 
12. 
12. 
14. 
11. 
12. 

8. 
17. 
18. 
15. 
15. 
15. 
16. 
16. 
17. 


84 














. 67 


37 
06 
74 
16 
00 
02 


. 82 
. 49 


06 
85 


. 02 


73 
93 
55 
06 














5.21 25 
4.29 27 
4.95 33 
4.19 23 
4. 66 25 
4. 06 20 
4.31 23 
3.25 26 
4.91 24 
5. 72 24 
5. 09 25 
4.77 26 
3.75 29 
3.32 22 
4.05 23 
5. 56 24 








Paragraph-Revision 
Total 


Paragraph A 


Paragraph B 










Mean 


8. D. 


(N) 






Mean 


S.D. 


(N) 


Mean 





S.D. (N) 








40. 
40. 
35. 
40. 
43. 
28. 
41. 
29. 
46. 
43. 
38. 
42. 
40. 
43. 
44. 
39. 





92 
42 
73 
13 
08 
30 
04 
83 
04 
67 
88 
58 
41 
50 
30 
70 


i 
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58 
64 
01 
23 
55 
46 
36 
27 
76 
29 
47 
68 
76 
24 
66 
. 62 





25 
26 
33 
23 
24 
20 
23 
24 
23 
24 
25 
26 
29 
22 
23 
23 





31. 
32. 
27. 
31. 
33. 
25. 
31. 
25. 
34. 
32. 


34. 
31. 
32. 
34. 
31. 
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. 78 


24 


. 58 


45 


. 90 


47 
78 
85 
27 
96 
30 
76 
32 
15 


. 64 


22 








25 


33 
23 
25 
20 
24 
26 
23 
24 
25 
26 
29 


23 
24 
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. 04 


. 35 
. 96 


36 
71 
88 


08 


. 48 


. 86 
. 68 
. 00 





2,22 23 
2.73 24 
3.35 31 
3.56 23 
2.12 22 
2.81 7 
2.55 24 
2.47 16 
3.19 24 
3.59 24 
3.23 21 
2.79 24 
3. 65 28 
3.17 22 
2.55 22 
3.19 21 





JOURNAL OF EXPERIMENTAL EDUCATION (Vol. XXII 


TABLE I (Continued) 





Verbal Test Instructors’ Ratings Course Grades 
(N) S. D. 
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show large between-class differences. For each 
variable all scores reported are on the same 
scale, except for instructors’ ratings (the nu- 
merical magnitude of which depends onthe num- 
ber of students in the class) and course grades 
(which are comparable only for classes in the 
same institution). 

These means and standard deviations are of 
significance in interpreting the correlations with 
criteria which are given in Tables VI - X. Re- 
striction in range for particular classes will 
tend to depress the validities generally, but 
ought not to obscure the over-all relationships 
observed. It is to be expected, however, that 
the true validities for the entire group of stud- 
ents would be higher than those reported for re- 
stricted groups. (In Study II a technique has been 
devised for combining groups. ) 

These data may also be compared with the 
means and standard deviations for 294 cases 
given in Table III. Such a comparison readily 
shows the standing of a particular class in rela- 
tion to the larger group. For example, there 
are probably no students in Class 08 who reach 
the general mean for Section I of the Objective 
English Test, while most students in Class 09 
would be above the general mean. 

2. Reliability of Predictor Variables: There 
are two types of indices which are helpful in 
judging the reliability of essay material. The 
first is the ‘‘reader reliability, ’’ the extent to 
which readers agree with one another. A ran- 
dom sample of papers was selected for a re- 
reading of the essays and paragraph-revision, 
and correlations were run between the first and 
second sets of scores. These correlations are 
‘‘reader reliabilities’’ and are reported in Table 
II. The reliability for the total essay test was 
computed by summing the scores given by the 
first reader to an individual student on Essays 
I, 0, and Il], summing the corresponding scores 
given by a second reader, and correlating the 
two sets of summation scores. (Each of the 
three essays was assigned to a separate group 
of readers, so that each summation score rep- 
resents three readers.) Papers which were to 
be re-read had the initial score sheets removed 
and were distributed to the readers in the usual 
way so that no reader knew when he was ‘‘re- 
reading. ’’ In Table II the reader reliabilities 
are seen to be low, and this finding must be 
viewed in light of the fact that the true test re- 
liabilities must necessarily be lower. (The 
reader reliability may be thought of as a meas- 
ure of scoring accuracy, which for objective 
tests is presumably close to perfect. ) 

The other way in which essay reliability may 
be judged is to evaluate the intercorrelations of 
several essay questions. This is reasonable 
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when the essay questions are intended to meas- 
ure the same abilities, as was true inthe pres- 
ent study. In Table I, the intercorrelations 
of the three essay questions are found to be .41, 
.41, and .32. If the Spearman-Brown prophecy 
formula is applied, and the assumption is made 
that an essay test has three 20-minute questions, 
each with a correlation of . 41 with each of the 
other two, then the estimated reliability of the 
total test becomes .68. Such an estimate is the 
fairest available indicator of the true reliability 
of the essays used in this study. 

The Paragraph-Revision data indicate that 
high reader reliability may be attained with this 
type of exercise, although the intercorrelation 
of Paragraphs A and B (. 33) is withinthe range 
of the intercorrelations found among the essays, 
and thus does not promise a higher test reliabil- 


ity. 

* ‘rable II shows the estimated reliabilities of 
the objective English and verbal material. The 
Kuder~Richardson formula (21) which was adopt- 
ed here typically underestimates the reliability 
which would be obtained if formula (20), involv- 
ing individual item-difficulties, were.used. Ac- 
cording to Tucker 7 the results for formula (21) 
may be as much as ten percent less than those 
for formula (20). For the purposes of Study I 
as a preliminary investigation this briefer meth- 
od was regarded as adequate for checking onthe 
reliability of the objective English and verbal 
material. 

The 50-minute objective English section 
shows a conservatively estimated reliability of 
. 93 as compared with the possibly generous es- 
timate of .68 for 60 minutes of essay questions. 
Of the four categories of objective English ques- 
tions, the grammar and sentence-structure it- 
ems show the highest reliabilities. However, the 
‘‘P”’ and ‘‘I’’ groups are much shorter than the 
““G’’ and ‘‘S.’’ If the ‘‘P”’ and ‘‘I’’ were doubled 
in length, their lengths would be greater than 
the length of ‘‘G’’ and less than the length of 
‘‘S.’’ The Spearman-Brown prophecy formula, 
when applied on the assumption that ‘‘P”’ and 
‘I’? are doubled, yield reliabilities of .73 for 
‘‘P’’ and .53 for ‘‘I.’’ The reliabilities for the 
‘*P”’ and ‘‘I’’ categories are less than satisfac- 
tory, but they compare favorably with the essay 
material in the present study. The reliability 
of . 85 for the short verbal test indicates that 
these items are performing here with the con- 
sistency which other investigators have typically 
found in verbal material. 

3. Relationships Among Predictor Variables: 
The intercorrelations in Table II] are of value 
in shedding some light on the possible compon- 
ents of the abilities tested by each of the pre - 
dictor measures. For this purpose, lack of 





7, Ledyard BR, Tucker. “A Note on the Estimation of Test Reliability by the Kuder-Richardson 
Formila (20)," Psychometrika, XIV (1949), pp. 217-119. 
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TABLE I 


RELIABILITY INFORMATION ON PREDICTOR 
VARIABLES 





Estimated Reliabilities of Objective Measures 





Estimated 
Variable Reliability * 





Objective English Test 

Section I . 93 
‘*P’’ Score .57 
““T’’ Score . 36 
“G’’ Score . 82 
**S’’ Score . 81 
Verbal Test . 85 





Computed Reader Reliabilities for Essays and 
Paragraph-Revision 





Reader No. Papers 
Variable Reliability in Sample 


Essay Total . 78 
Essay I . 67 
Essay I .47 
Essay III . 67 
Paragraph A . 83 
Paragraph B . 59 








*These estimated reliabilities were computed 
by Kuder-Richardson formula (21): 


Mt 
oo: —o ~ Mt) 


n-1 2 
* 


n 





sy = 


Reference: G. F. Kuder and M. W. Richardson. 
‘*The Theory of the Estimation of Test Reliabil- 
ity, ’’ Psychometrika, [I (1937), pp. 151-160. 
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correlation is as significant as its presence. 
For example, Paragraph-Revision has fairly 
low and almost uniform correlations with all 
other variables; it is apparently influenced by 
factors somewhat unrelated to abilities tested by 
the other measures. Furthermore, the correl- 
ations between Paragraphs A and B are lower 
than the correlation of each with certain other 
viriables. 
| The relationships among the other principal 

pedictors are of interest. The Objective Eng- 
ligh has a higher correlation with Verbal (. 64) 
tign with Essay Total (. 60); however, when each 

{ficient is corrected for attenuation of both 

iables, 8 the correlation with Essay Total be- 


rected for attenuation. Thus it appears that 
se three principal predictors are largely meas- 


of the Essay Test would hamper its measur~- 

ing efficiency. There is some indication, how- 
$r, that the Essay Test is measuring a lang- 
aye ability in addition to verbal ability, since 


tionship with Objective English than with 
‘ybal, 

The ‘‘P,’’ “‘I,’’ ‘‘G, ’’ and ‘‘S’’ scores show 
hi;her relationships with each other than with 
other variables. The fact that each ofthese sub- 
scures correlates more highly with the others 
than with Verbal indicates that the Objective Eng- 
lish Test is measuring some trait other than 
verbal ability. 

4. Reliability of Instructors’ Ratings: The 
instructors’ original placing of the students ina 
simple rank-order continuum is shown in Table 
IV to bear a high relationship to the results ob- 
tained by the extremely painstaking paired-com- 
parison method, The correlations range from 
. 84 to .99, with a median r of . 97; all but two 
are between .93 and .99. There is no evidence 
that the preliminary ranking had any greater in- 
fluence on the later results than the paired-com- 
parisons had on one another. The instructors 
did not know in advance that their original rank- 
ings would be checked so meticulously, and they 
had no way of referring back to them during the 
paired-comparison session. Paired-comparisons 
are ordinarily used with raters who are inexper- 
ienced or resistant, neither of which conditions 
obts ined among this highly trained and coopera- 
tive group of instructors. The demonstrated 
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consistency of the two methods with these raters 
led to the conclusion that simple rank-order 
might be used in Study II where somewhat less 
interviewing time would be available. 

5. Relationship Between Criterion Variables: 
In Table V the correlations between instructors’ 
ratings and course grades are seen to range 
from .44 to .91, witha medianr of .85. De- 
spite the attempt to ‘‘purify’’ the ratings, it is 
not surprising that the correlations are this high 
since the course grades are all in English Com- 
position. The correlations are low enough, how- 
ever, to reward the attempt to separate the two 
types of criterion. It is noteworthy that the low 
correlations in Table V do not tend to be associ- 
ated with the lower reliabilities reported in 


Table IV. 
6. Relative Validities of Predictor Variables: 


The correlations, by class, of the predictor var- 
iables with instructors’ ratings and with course 
grades are presented in Tables VI-X inclusive. 
In view of the large between-class differences 

in ability, the r’s cannot all be regarded as es- 
timates from the same population and thus can- 
not be averaged. However, a rough estimate of 
the relative efficacy of the various predictors 
can be made by computing the median r’s over 
all classes.9 Such a comparison should indicate 
the worthwhileness of continuing the investiga- 
tion. 

Data for the four principal predictors appear 
in Tables VI and VII. In predicting instructors’ 
ratings the Objective English Test and the Verb- 
al Test appear to be best with median r’s of .43; 
the Essay Test is next with a median r of . 34; 
Paragraph-Revision is lowest with a median r 
of .27. With course grades as a criterion the 
Essay Test is highest with a median r of . 46; 
the Verbal Test has a median r of . 38; the Ob- 
jective English Test is close to Verbal witha 
median r of .34; Paragraph-Revision remains 
lowest with a median r of . 23. 

With respect to the criterion of instructors’ 
ratings, the effectiveness of the individual pre- 
dictors may be further analyzed on the basis of 
the frequency with which they appeared to be 
superior. While the median validity for the Ob- 
jective English Test is higher than for the Essay 
Test (. 43 as compared with . 34), the Objective 
English Test nevertheless showed higher within- 
class correlations in only 8 of the 15 classes 
while the Essay Test was superior in 7 classes. 
This would indicate that the superiority of the 
Objective English Test may not be stable. How- 
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TABLE VI 


CORRELATIONS OF PRINCIPAL PREDICOTRS WITH INSTRUCTORS’ RATINGS (BY CLASS) 





Objective English Essay Paragraph- Verbal 
Section I Questions Revision Test 


(N)* 
25 











Median 
re* . 43 





*(N) = Number of cases 
**Excluding Class 10 
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TABLE VII 


CORRELATIONS OF PRINCIPAL PREDICTORS WITH COURSE GRADES (BY CLASS) 





Objective English Essay Paragraph- Verbal 


Claes Section I Questions Revision Test 


No. r (N)* r (N) r (N) 


01 . 53 25 . 60 25 -15 25 
02 . 60 26 . 57 26 . 34 26 
03 .70 33 . 76 32 . 23 33 
04 . 61 24 . 38 23 13 23 
05 .13 23 . 29 24 .24 24 
06 . 55 22 . 24 19 . 58 20 
07 .17 22 . 63 20 . 02 
08 . 34 23 . 46 23 . 33 
09 . 28 23 . 57 23 . 06 
10 oes os . 33 24 . 32 
11 .19 29 . 16 25 . 09 
12 15 26 - 45 26 . 35 
13 . 26 30 . 36 29 .37 
14 -41 22 . 49 22 . 65 
15 . 30 23. OC, .31 23 . 58 
16 .37 23 . 56 23 . 03 











Median 
r** .34 





* (N) = Number of cases 
**Excluding Class 10 
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TABLE VIII 


CORRELATIONS OF OBJECTIVE ENGLISH TEST SUBSCORES WITH INSTRUCTORS’ 
RATINGS AND COURSE GRADES (BY CLASS) 





Correlations with Correlations with 
Instructors’ Ratings Course Grades 


s 


. 53 
. 59 
. 73 
.55 
. 07 
. 58 
. 16 











Median 
r? .29 . 22 . 49 . 36 . 25 .21 .33 








' P = Punctuation; I = Idiom; G = Grammar; S = Sentence Structure 
* (N) = Number of cases 
> Excluding Classes 09 and 10 
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TABLE X 


CORRELATIONS OF INDIVIDUAL PARAGRAPH-REVISION QUESTIONS 
WITH INSTRUCTORS’ RATINGS AND COURSE GRADES (BY CLASS) 





Correlations with 





Instructors’ 
Ratings Course Grades 











Median 
r* . 25 








A = Paragraph A 

B = Paragraph B 

(N) = Number of cases 
Excluding Classes 06 and 08 
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ever, the Verbal Test, also with a median val- 
idity of .43, showed itself superior to the Essay 
Test in 10 of the 15 classes and superior to the 
Objective Test in 8 of the 15 classes (equal in 
one class). Thus there is a strong indication 
that the Verbal Test may be the most dependable 
predictor of instructors’ ratings. 

On the other hand, with respect to course 
grades, the Verbal Test drops to third place in 
predictive value. The Essay Test surpasses 
both the Objective English Test and the Verbal 
Test in 10 of the 15 classes, while the Objective 
English Test surpasses the Verbal Test in 9 of 
the 15 classes (equal in one class). Thus the 
Essay Test appears to be the most dependable 
predictor of course grades although it is infer- 
ior in predicting instructors’ ratings. This dis- 
crepancy in the validities of the Essay Test may 
be an indication that it has a language-ability 
component in addition to the verbal component. 

It is possible that in this study many students 
who received the benefit of a high rating on the 
basis of their best work were unable to main- 
tain the same rank in class on the basis of their 
over-all performance. In this case, the Verbal 
Test would appear best in estimating the highest 
capabilities of each student, while the Essay Test 
is more closely related to the students’ average 
performances in a freshman composition course. 
It might be noted that freshmen are typically re- 
quired to write on a number of subjects uncon- 
genial to them, whereas more advanced students 
have greater opportunity for personal choice. 
The two major criteria, with different underly- 
ing assumptions, may both be desirable depend- 
ing upon the situation in which they are used. 

The relative values of the subscores for each 
predictor variable are also of interest. The 
data in Table VIII show that for both criteria the 
grammar and sentence-structure items were the 
most valid items in the Objective English Test. 
In fact, grammar alone did as well as the total 
test, and sentence-structure did almost as well. 

Of the three essay questions, question III had 
the highest median validity for both criteria, and 
question I was equally valid in predicting course 
grades (Table IX). However, the data illustrate 
the variability in validity of essay questions de- 
signed to measure the same thing. 

The median r’s for Paragraphs AandB, shown 
in Table X, were computed after excluding Class- 
es 06 and 08 because so few students in those 
classes answered Paragraph B. Apparently there 
is little difference in validity between the two 
paragraphs. 


(d) Conclusions 
1. The reliabilities of the Objective English 


Test (. 93) and the Verbal Test (. 85 for 30 items) 
are sufficiently high for satisfactory measure- 
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ment of college freshmen. In actual practice, 
the Verbal Test would be longer and hence more 
reliable. 

2. The reader reliability for the Essay Test 
(.78 for three essays combined) is too low for 
satisfactory measurement. The test reliability 
for Essay Total was estimated at . 68. 

3. The Paragraph-Revision Test showed a 
high reader reliability for Paragraph A (. 83) but 
a lower one for Paragraph B (.59). The data 
indicate that there is a potentiality, at least, for 
satisfactory reader reliability in Paragraph- 
Revision. 

4. The relationship of Paragraph-Revision 
with all other variables is low and indetermin- 
ate. 

5. The Objective English Test, the Verbal 
Test, and the Essay Test are measuring the verb- 
al factor chiefly. In addition, the Objective Eng- 
lish Test and the Essay Test may have another 
element in common-—presumably achievement 
in the handling of language. 

6. Rank-order ratings by instructors show 
sufficiently high reliability (median r of .97 with 
paired-comparisons) to justify using them in 
Study II. 

7. Instructors’ ratings and course gradesare 
highly correlated (median r of . 85). 

8. Except for the fact that Paragraph-Revis- 
ion shows consistently the lowest validities, no 
definite conclusions can be made regarding the 
relative validities of the principal predictors. 
The Verbal Test showed the closest relationship 
to instructors’ ratings while the Essay Test was 
the best predictor of freshman English grades. 

9. The grammar and sentence-structure it~ 
ems in the Objective English Test showed high- 
er reliabilities and validities than did the punc- 
tuation and idiomatic-expression items. 

10. Essay III appeared to be the most valid 
of the essay questions, and was one of the two 
most reliable of the essay questions. 

11. A continuation of the investigation is need- 
ed, in which a larger group of students may be 
used and in which criterion data may be made 
comparable over all classes. 


E. Study II 
(a) The Test Population 


Forty-four groups of secondary~school sen- 
iors were identified who took the College Entrance 
Examination Board’s April examination in English 
Composition and whose teachers were in Prince- 
ton as members of the corps of readers for that 
examination. All possible groups of at least 13 
students, whose teachers were in Princeton, 
were utilized. The groups ranged in size from 
13 to 26 students and represented 30 schools (one 
teacher from each school); there were 21 private 
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schools with 33 groups of students and 9 public 
schools with 11 groups. The selective factors 
operating in favor of the private schools were: 
(1) certain private schools tend to train more 
C. E. E. B. candidates per school than do public 
schools; (2) private schools are more interested 
than are public schools in releasing their teach- 
ers for duty in Princeton as readers. 

It was possible to obtain complete data ona 
total of 420 students, and data for all variables 
except the Verbal Test on 763 students. Inorder 
that no data might be lost, separate intercorrel- 
ation tables were prepared for the total group of 
763 students and for the subgroup of 420. 


(b) Description of Variables 


A one-hour examination in English Composi- 
tion was administered which consisted of three 
parts: 


(1) Objective English: On the basis of item- 
analysis results obtained in Study I, 45 items 
were selected to make a well-rounded test, 10 
Since the grammar and sentence-structure ma- 
terial showed the highest reliability and validity 
in Study I, these types were emphasized inStudy 
Il. The 45 items were distributed as follows: 


6 punctuation items 

7 idiomatic-expression items 
16 grammar items 

16 sentence-structure items 


(2) Essay: In view of its superior perform- 
ance in Study I, essay question Ili was chosen 
for Study II. There was a slight change inthat 
part of the wording which involved instructions 
to write only one paragraph and to use about 150 
words.. The new wording was: 


In a single paragraph discuss a serious 
error which you think parents may make 
in the rearing of a child, and indicate 
the possible consequences of this error. 
Use your own experience or your obser- 
vation of friends and acquaintances to 
support your opinion. (A satisfactory 
discussion can be presented in a para- 
graph of about i150 words. ) 
Two scores on the essay were obtained: (1) con- 
tent and organization; (2) style (including spell- 


ing, punctuation, syntax, vocabulary, and sen- 
tence structure). 


10. For sample items see Appendix I. 


ll. For illustrative items see Appendix II. 
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(3) Paragraph-Revision: The paragraph-re- 
vision material from Study I was reproduced 
here without change. 

Many of the students who took the English 
Composition Test also took the Scholastic Apti- 
tude Test, thus providing as a fourth predictor: 


(4) Score on Verbal Sections of the Scholas- 
tic Aptitude Test: The verbal material 11 occu- 
pied 100 minutes of testing time, with four sep- 
arately timed parts as follows: 


Part 1—30 minutes; 65 items 
Three item-types: 
Analogies, 25 items 
Antonyms, 20 items 
Sentence completion, 20 items 
Items are arranged in groups of 5, and item- 
types are rotated (e.g., 5 analogies, 5 anto- 
nyms, 5 sentence completions, etc. ) 


Part 2—30 minutes; 30 items 
Reading-comprehension material consisting 
of 5 paragraphs each followed by questions 
based on its content. 


Part 3—25 minutes; 65 items 
Two item-types: 
Antonyms, 40 items 
Analogies, 25 items 
Groups of 8 antonyms are alternated with 
groups of 5 analogies. 


Part 4—15 minutes; 10 items 
Two reading-comprehension paragraphs each 
followd by questions based on its content. 


The total scores on the Verbal Test were con- 
verted to a standard scale with a mean of 500 
and a standard deviation of 100. For the entire 
population taking the test at this series, the 
mean was 495 and the standard deviation 107. 
For convenience in working with the statistics 
for the experimental group the standard scores 
were divided by 10. 

The fifth and sixth major variables were cri- 
terion variables, as described below: 


(5) Instructors’ Ratings: As in Study I, the 
major criterion consisted of ratings by the teach- 
ers of their students’ ability to write. Before 
coming to Princeton the teachers were told that 
they would be asked to make some ratings of 
their students’ writing ability and it was sug- 
gested that they familiarize themselves thor- 
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oughly with the work of those students to be rat- 
ed. The teachers had no advance knowledge, 
however, of the manner in which the rating was 
to be done. Most of the teachers were already 
So intimately acquainted with their students’ 
performance that little review was required; on 
the whole it appeared that these teachers knew 
their students better than the college instructors 
in Study I had known theirs. During the first 
two days of the reading period, the teachers 
were taken in groups of two or three toasmall 
quiet room in which the rating procedure could 
be discussed at leisure. Each group discussion 
progressed to the point where each teacher indi- 
cated an understanding of the principles evolved 
in Study I. 12 Then each teacher was givenaset 
of cards for each group of his students and was 
asked to arrange the cards in rank order. The 
teachers had unlimited time and a quiet place 
for their work. Upon completion of the ratings 
the teachers were thanked and were left with the 
impression that their participation in the study 
was finished, At the end of the week they were 
asked to do the ratings again in order to insure 
the maximum degree of accuracy; it was explain- 
ed to them that their ratings on the two occas- 
ions would be averaged, but that they should 
have no concern whatever for aiming at consis- 
tency. Since there was no reward associated 
with consistency, and since the teachers’ atti- 
tude was entirely that of desiring to be helpful 
in a study they considered significant, it is the 
writer’s belief that the raters achieved a high 
standard of carefulness and conscientiousness. 

The two ratings for each student were then 
averaged to determine the rating to be used in 
the study. Reliability was estimated by cor- 
relating the first sets of ratings with the second; 
however, it is to be expected that the combina- 
tion of both ratings would be more accurate than 
either set of ratings alone. All ratings were 
finally recorded in terms of percentile ranks 
within groups 

(6) English Course Grades: As a secondary 
criterion, final semester grades in English for 
semesters 5, 6, 7, and 8 were averaged for each 
student who had been in attendance at the same 
school for that length of time. Most of the stud- 
ents were in this category. For students who 
hadtransferredfrom other schools, only the 
grades obtained at the school currently attended 
were used. The course grades were typical in 
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that they represented the students’ success in 
all the types of performance evaluated in sec- 
ondary school English courses and not composi- 
tion alone. It was felt that averaging the four 
grades would minimize the effect of unreliabil- 
ity in grading, and that omitting grades from 
other schools would rule out the effects of dif- 
ferences between schools. It is recognized, of 
course, that teachers in the same school differ 
in their marking standards, and no claim is made 
for this criterion as a highly reliable measure. 
However, since composition is an important part 
of English course work, and since success ina 
composition test is often used for guidance and 
placement, it is of interest'to observe the ex- 
tent to which the predictor variables and the rat- 
ings are related to course achievement. 


(c) Results 


(1) Reliability of Predictor Variables: Inthis 
study the reader reliabilities were computed by 
a method which yielded results based upon a 
large number of readings of each paper, rather 
than on two per paper as in Study I. In order 
that each reader assigned to a question might 
have an individual copy of each paper in the 
sample, the papers were reproduced photograph- 
ically. 13 The papers were selected with the in- 
tention of including answers which varied in 
quality——a selective factor which would be ex- 
pected to produce a higher reliability than if the 
papers had been chosen to be representative of 
the entire group of candidates. 

For each student’s paper, the approximate 
‘‘true score’’ was defined as the average score 
assigned it by all readers who read that paper. 
Thus, an individual reader was ‘‘reliable’’ to 
the extent that his scores corresponded with the 
‘true scores. ’’ Complete data were available 
for 39 readers of the essay question and for 30 
readers (two groups of 15 each) of the paragraph- 
revision section. The ‘‘true scores’’ then, were 
based on 39 observations in the case of the es - 
say and 15 observations in each of the analyses 
of the paragraph-revision. 

The formula for the reader reliabilities was 
based on the assumption that the obtaining of 
‘*true scores’’ makes it possible to relate these 
scores to the formula for the index of reliability. 
Thus, for an individual reader, the correlation 
between the scores he gives and the true scores 





12. Professor Blledge's performance of a role similar to that he played in Study I was of great 


value in establishing continuity. 


13. An unpublished study previously conducted by the College Entrance Examination Board had 
shown that insofar as means, standard deviations and intercorrelations are concerned, the 
Orginal blue books and the copies are equivalent. 





192 JOURNAL OF EXPERIMENTAL EDUCATION (Vol. XXII 


(ryt) is defined as the square root of his relia- 
bility (Vrxx1). The reader’s reliability, then, 
would equal ryt. The formula was developed 
as follows:14 


uxt 
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: (2 xt)* Over-all reliability for n 
(22 x*) (nzt*) readers 


The reliabilities for the Objective English 
section and for the Verbal score were computed 
by correlating scores on the odd-numbered it- 
ems with scores on the even-numbered items 
and correcting by the Spearman~-Brown prophecy 
formula. For the Objective English section the 
reliability is not affected by speededness since 
everyone had an opportunity to answer all items. 
The Verbal reliability was somewhat affected 


since only 41% of the students completed the test. 


Nevertheless, there is no doubt that a relatively 
uns peeded verbal test can attain a reliability of 
.90 or higher. For example, data from the files 
of the Educational Testing Service show that in 
one of its corfidential testing programs for col- 
leg* applicarits a 50-item verbal test consisting 
of analogies and sentence completions was com- 
pleted by 93% of the students and yielded a reli- 
ability of . 88 as computed by Kuder-Richardson 
(20). 

All the reliabilities are reported in Table 
XI. These reliabilities are based on papers 
drawn from the entire test population, not from 
the «experimental group alone. For the objective 
measures, spaced samples were taken in such 
a way as to be representative of the total group. 

Yhe highest reliability, .96, was attained by 
the Verbal Test. The Objective English section, 
which occupied only about 20 minutes of testing 
time, attained a reliability of .78. If the entire 
60 minutes of English Composition had been de- 
voted to objective questions the reliability, as 


estimated by the Spearman-Brown prophecy 
formula, would have risen to . 91. 

The Paragraph-Revision demonstrated high 
reader reliabilities, ranging from .69 to .84. 
Paragraph A was the better of the two paragraphs, 
showing for two groups of readers reliabilities 
of . 81 and . 84; this is approximately equivalent 
to the reader reliability of .83 obtained for Par- 
agraph A in Study I. The reader reliability for 
Paragraph B rose from .59 in Study I to .69and 
.72 in Study II. It may be concluded that a par- 
agraph-revision test may achieve a highly sat- 
isfactory reader reliability, particularly if an 
entire hour were devoted to it. However, the 
test reliability remains unknown. The correla- 
tion between Paragraph A and Paragraph B was 
.37, only slightly higher than that obtained in 
Study I. If this correlation is taken as an esti- 
mate of test reliability, then application of the 
Spearman-Brown prophecy formula would yield 
an estimated test reliability of .54; if the test 
were extended to three times its length, the es- 
timated reliability would become .78. The na- 
ture of the material itself led the English exam- 
iners to believe that it would be possible to de- 
velop questions which would show higher inter- 
correlations, thus yielding higher estimated re- 
liabilities. 

The reader reliability for the essay question 
is a little lower than that obtained in Study I, .62 
as compared with .67. We have no essay inter- 
correlations in Study II by which to estimate test 
reliability, but the estimates in Study I of . 41 
for a twenty-minute question and . 68 for a test 
containing three such questions remain reason- 
able. 


(2) Reliability of Instructors’ Ratings: The 
correlations between the first and second sets 
of ratings by each instructor were computed to 
give an estimate of the reliability of the ratings. 
The Spearman ranks formula was used: 


_1- 62D? 
or NiN? - 1) 


The results are presented in Table XII, show- 
ing a range of . 86 to . 999 with a median of . 97. 
Since the two ratings for each student were av- 
eraged to obtain the rating used in the study, 
the reliabilities of the averaged ratings should 
be higher than those indicated in Table XII. It 
is recognized, of course, that the instructor’s 
memory of his previous ratings must have been 
operative to some extent, although every effort 
was made to minimize this effect. The writer 
regards the obtained ratings as highly reliable 





in that they represent considered and consistent 





—— 


14, his treatment wae devised by Dr. Tucker and is 


one of the routine methods used at ETS. 
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TABLE XxI 


RELIABILITY INFORMATION ON PREDICTOR VARIABLES 





Reliabilities of Objective Measures (odd-even correlations 
corrected by Spearman-Brown prophecy formula) 





No. Papers 
Variable Reliability in Sample 


Objective English . 78 500 
S.A.T. Verbal . 96 500 








Reader Reliabilities for Essay and Paragraph-Revision 


Reader No. of No. Papers 
Variable Reliability Readers in Sample 


Essay . 62 39 38 
Paragraph A* . 84 15 39 
Paragraph A* . 81 15 40 
Paragraph B* . 69 15 38 
Paragraph B* . 72 15 38 











*Two groups of readers were used for each of the para- 
graphs in the Paragraph-Revision section. 


TABLE XI 


CORRELATIONS BETWEEN FIRST AND SECOND 
RATINGS BY INSTRUCTORS IN THE FORTY~ 
YOUR CLASSES 
(Computed by the Spearman ranks formula) 








Relia- No. of Relia- No. of Relia- No. of 
bility Classes bility Classes bility Classes 


. 999 . 95 2 . 90 1 
. 99 . 94 1 . 89 0 
. 98 . 93 . 88 1 
0 
1 





. 97 . 92 . 87 
. 96 -91 . 86 





Median f= . 97 
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estimates of the students’ ability. 

(3) Combination of Data: The key problem in 
a study which attempts to use criterion data from 
different sources is to combine the criterion 
measures in such a way that they may be regard- 
ed as oeing ona common scale, In the present 
investigation it is obvious that course grades 
do not have the same meaning from class to 
class and that equivalent percentile ranks (based 
on teachers’ rank-order ratings) likewise are 
not indicative of equal ability. A student witha 
low rank in an able class actually may have more 
ability than students with higher ranks in less 
able ciasses. 

To solve this problem a multiple-regression 
adjustment was utilized. 15 Since each of the 
predictor variables places students from all 
classes on a common scale, it is possible tode- 
velop multiple-regression equations which will 
determine a corrected criterion score for stud- 
ents in each class. These corrected criterion 
scores thus take into account between-c lass 
differences in ability insofar as these differenees 
can be indicated by the test scores. The three 
separate steps in procedure, culminating in the 
computation of the adjustments in Analysis II, 
are described below. 


Analysis I-— Using raw data, with no adjustment 
for class differences, a table of intercorrela- 
tions was computed, Multiple correlations were 
also reported, together with regression weights. 
The correlations with criteria should be spur- 
iously low because of the noncomparability of 
criterion scores. (Tables XII], XVI, XIX, XX) 


Analysis Il—After expressing each score on 
each variable in terms of its deviation from its 
own class mean, the second over-all tables of 
intercorrelations with multiple correlations and 
regression weights were computed. (Tables XIV 
XVII, XIX, XX) Since, for each class, the 
ranges of scores on predictor variables were 
thus reduced, the intercorrelations among pre- 
dictor variables were correspondingly depressed. 
However, correlations involving the criterion 
variables were increased because the criterion 
measures have much more validity withinclasses 
than when pooled together as in Analysis IL. 


terion measures were obtained by using the raw 
score regression weights from Analysis Il. The 
corrected criterion scores were obtained by ad- 
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ding corrections to the original criterionscores, 
there being one additive correction for all stud- 
ents in each class. 


If a single predictor were involved, the cor- 
rected mean criterion score for class 1 (Corr. 
Y,) could be expressed as follows: 


(Corr. ¥,) = ¥ + byx (X, - X) 


(¥ is the mean criterion 
score for all classes; byx 
is the raw-score regres- 
sion of Y on X; X, is the 
mean predictor score for 
class 1; and X is the mean 
predictor score for all 
classes. ) 


The adjustment constant to be added to each 
score in class 1 would then be the difference 
between the obtained mean for the class andthe 
corrected mean;16 


A, = (Corr. Y,)- ¥, = ¥- Y, + byx &, - X) 
The adjustment constant based on the raw - 


score regression weights computed in Analysis 
II for all predictor variables (j) becomes: 





A, = ¥ - Y, + (2j X,j byj) - (2j Xj by;) 


The adjustment constants thus computed for each 
class were added to each criterion score; both 
ratings and course grades were adjusted by this 
method. A new table of intercorrelations was 
then constructed using the new adjusted scores 
for the criterion variables. Multiple correla- 
tions and regression weights were again com- 
puted. (Tables XV, XVIII, XIX, XX) 


As a result of this of adjustment, validities 
would tend to be increased because of the in- 
crease in range of criterion scores. Intercor- 
relations among the predictor variables are 
completely unaffected by this adjustment of the 
criterion variables. 

The adjusted validities may be slight over- 
estimates, with the true validities lying some- 
where between those shown in Analysis I and 
those shown in Analysis II]. Comparisons will 
show, however, that the increase in validities 
in Analysis III tend to be quite small. 





15. Dr. Tucker was responsible for developing this technique for use at the Bducational Testing 


Service. 


16. A basic formula of this type is given in: 8. 8. Wilke, 1 ti 
(Princeton, New Jersey: Princeton University Press, 1948), p. 251. 
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(4) Interrelationships Among Variables: In 
Tables XIII-XVIII are presented the intercorrel- 
ations among all the variables studied. Tables 
XIII-XV represent the 420 cases who took the 
Verbal test as well as the English test. The en- 
tire group of 763 cases are represented in 
Tables XVI-XVII, which do not include the 
Verbal test. For an analysis of the results of 
the study, only Tables XIII-XV are needed; the 
additional tables were computed to determine 
whether any significant information was being 
lost by the enforced dropping of the 343 students 
who did not take the Verbal test. Comparison 
of the two sets of tables will indicate that the in- 
tercorrelations for the two groups are approxi- 
mately the same. 

The means and standard deviations reported 
in Tables XIII-XVIII reflect the fact that the 
three parts of the English test were weighted in 
computing the total English score. The Objec- 
tive English and the Essay each receiveda weight 
of 3, while the Paragraph-Revision was weight- 
edil. This weighting was done early in the study 
on the basis of an observed ranye of scores on 
a sample of 100 papers, and was intended to 
equalize the contribution of the three parts in 
the reported scores. The full data indicate that 
this system tended to weight the Objective Eng- 
lish somewhat too heavily; however, this weight- 
ing does not affect any of the intercorrelations 
except those involving the total English score. 

For the following discussion of the interrela- 
tionships among variables, Table XV will be 
employed since it includes all variables and rep- 


resents the analysis for adjusted criterionscores. 


Paragraph-~Revision continues to show the 
lowest correlations with other variables, al- 
though the correlation between Paragraphs A 
and B rose from .33 in Study J to . 37 in Study I. 
However, unlike the finding in Stuay I, Para- 
graphs A and B correlate higher with eachother 
than with any other variable. 

Objective English continues to show a higher 
correlation with Verbal (. 61) than with Essay 
(.33); however, the correlation of Objective Eng- 
lish with this particular essay question (Essay 
III in Study I) was higher (. 47) in Study I. In the 
present study the correlation of Verbal with Es- 
say is .39, as compared to a correlation of .36 
with Essay III in Study I. When the present in- 
tercorrelations are corrected for attenuation, 
using an estimated reliability of . 41 for the es- 
say question, no difference appears in the order 
of relationship. The corrected correlation of 
Essay with Verbal is .62. This would corrobor- 
ate the indication in Study I that these three 
variables are measuring the verbal factor toa 
large extent. However, contrary to Study I, the 
Essay score now shows a higher relationship 
with Verbal than with Objective English, so that 
the supposition that the Essay measures a lang- 
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uage ability in addition to verbal ability is not 
supported here. 

Surprisingly enough, the correlation between 
the two criterion variables is very high, .82. 
While this approximates the median intercor- 
relation of . 85 found in Study I, it would have 
appeared to the writer that the criteria would 
be more independent of each other in the pres- 
ent study where average grades over two years 
of work are employed and where many of the 
course grades had been assigned by instructors 
not participating in the study. The two criteria 
are, moreover, consistently comparable with 
each other in their relationships to other vari- 
ables. While it seems on this basis that the 
painstaking process of obtaining ratings might 
have been dispensed with, it is nevertheless re- 
warding to note that a highly consistent criter- 
ion of the students’ ability has been attained. 
Composition is the most heavily emphasized 
aspect of secondary~school English courses, 
and thus must be regarded as a strong compon- 
ent in course grades, It is to be expected that 
some criterion unreliability is present, but the 
correlations of .76 and . 77 with another vari- 
able (Verbal) indicate that the criteria inthis 
study are much more reliable than is ordinarily 
thought possible either for school grades or for 
subjective ratings. 

There are wide differences in validity among 
the predictor variables. The Verbal test is out- 
standing, with validities of .76 and.77. The 
Objective English is second best, with validities 
of .58 and .60. The other validities are: Essay 
.40 and .41; Paragraph-Revision, .35 and . 37. 
The ‘‘style’’ scores in the Essay question show- 
ed higher validities (.39 and . 39) than did ‘‘con- 
tent’’ (.22 and .26). Paragraph B (. 33 and .33) 
was more predictive than was Paragraph A (.26 
and .29). The total English test (. 63 and . 65) 
was more predictive than any of its parts, but 
less predictive than the Verbal. 

It is difficult to apply the correction for at- 
tenuation to these validities since the reliabil- 
ities of the criteria are unknown and since some 
inaccuracy is already introduced by the neces- 
sity for estimating the reliability of the Essay 
question. However, it is of some interest to 
make the best estimate possible for the criter- 
ion reliabilities and to see if there is any indi- 
cation that other variables, when corrected 
for unreliability, might prove more valid than 
the Verbal test. If for example, the Essay 
had a higher validity than the Verbal when cor- 
rected for attenuation, then it might be hypoth- 
esized that an Essay test measures a compon- 
ent of writing ability other than the verbal fac- 
tor and that if made sufficiently long would 
prove a more satisfactory measure of writing 
ability. 

From the data at hand, it seems reasonable 
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TABLE XIx 


MULTIPLE CORRELATIONS FOR ALL PREDICTOR VARIABLES WITH CRITERIA 


(N = 420) 
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Predictors 


Multiple Correlation with 





Analysis I 


Instructors’ 
Ratings = . 68 


Average English 
Grade = .50 





1) Objective English 
2) Essay~~Content 
3) Essay——Style 

4) Paragraph A 

5) Paragraph B 
11) Verbal Test 


with beta-weights: 
.18 
-. 02 
oa 
-. 03 
. 05 
. 50 


with beta-weights: 
.15 
. 04 
. 04 
-. 02 
.13 
. 30 





Analysis II 


Instructors’ 
Ratings = .75 


Average English 
Grade = .75 





( 1) Objective English 
( 2) Essay—Content 
( 3) Essay—Style 

( 4) Paragraph A 

( 5) Paragraph B 

(11) Verbal Test 


with beta-weights: 
one 
-. 03 
.13 
. 00 
. 09 
.57 


with beta-weights: 
.19 
. 02 
.10 
. 02 
. 08 
. 55 





Analysis Il 


Instructors’ 
Ratings = .79 


Average English 
Grade = . 80 





( 1) Objective English 
( 2) Essay~—Content 
( 3) Essay—Style 

( 4) Paragraph A 

( 5) Paragraph B 

(11) Verbal Test 





with beta-weights: 
. 16 
~. 03 
.13 
. 00 
. 09 
. 60 





with beta-weights: 
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TABLE XX 


MULTIPLE CORRELATIONS WITH CRITERIA FOR PREDICTOR VARIABLES 
EXCLUDING VERBAL TEST 
(N = 763) 





Predictors Multiple Correlation with 





Analysis I Instructors’ Average English 
Ratings = .53 Grade = .51 





with beta-weights: with beta-weights: 


(1) Objective English 
(2) Essay—Content 
(3) Essay—Style 

(4) Paragraph A 

(5) Paragraph B 





Analysis II Instructors’ Average English 
Ratings = .59 Grade = . 62 





with beta-weights: with beta-weights: 
(1) Objective English . 42 ; 

(2) Essay—Content . 07 
(3) Essay—Style .18 
(4) Paragraph A . 06 
(5) Paragraph B .14 





Analysis III Instructors’ Average English 
Ratings = . 63 Grade = . 64 





with beta-weights: with beta-weights: 
(1) Objective English . 43 : 

(2) Essay—Content . 07 
(3) Essay—Style . 18 
(4) Paragraph A . 06 
(5) Paragraph B .15 
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to estimate the reliability of the criteria at .82, 
the correlation of each with the other. Then 
the correlations between the following pairs of 
variables, when each member of a pair is infin- 
itely long, are: 


Essay vs. Average English Grade .71 
Essay vs. Instructors’ Ratings . 69 
Objective English vs. Average 

English Grade . 15 
Objective English vs. Instructors’ 

Ratings 73 


Thus, no evidence appears that either of these 
English tests is as closely related to writing 
ability as is the Verbal test which attained un- 
corrected validities of .76 and . 77. 

(5) Components of Writing Ability: Tables 
XIX and XX are presented to demonstrate the 
relative extent to which each of the test vari- 
ables contributes to a measurement of writing 
ability. Multiple correlations using all variables 
are shown in Table XIX, and Table XX gives the 
multiples obtained when the Verbal test is elim- 
inated. 

Examination of the beta-weights in Table 
XIX indicates that the Verbal test is the chief 
determiner of the multiples with weights of . 60 
and .58, while the other variables are of small 
or no significance; Objective English has weights 
of .16 and .18, Essay - Style has weights of .13 
and . 10; Paragrah B has weights of . 09 and . 08, 
and the weights of Essay - Content and Para- 
graph A are approximately zero. The rise in 
weights for the English scores when Verbal is 
eliminated indicates that they are overlapping 
the function of the Verbal test. That the English 
scores are less closely related to the criteria 
than is the Verbal score is shown by the lower 
multiples (.63 and . 64) in Table XX than in 
Table XIX (. 79 and . 80). 

While the sizes of some of the beta-weights 
in Table XIX might appear to indicate that cer- 
tain English variables are measuring a separate 
language factor, this is not borne out by a com- 
parison of the multiple correlations of .79 and 
.80 with the zero-order validity coefficients of 
.76 and .77 attained by the Verbal test alone. 


(d) Conclusions 


1. The reliabilities of the Objective English 
section (. 78) and of the Verbal test (. 96) are 
satisfactory. From the viewpoint of reliability, 
an entire hour devoted to Objective English 
would be highly satisfactory, yielding a coeffic- 
ient of . 91 for an unspeeded test. 

2. The reader reliability of the Essay ques- 
tion was .62. This is unsatisfactory, particu- 
larly in view of the fact that the true reliability 
must necessarily be lower. 
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3. The Paragraph-Revision section showed 
high reader reliabilities, . 84 and . 81 for Par- 
agraph A and . 69 and . 72 for Paragraph B. 

4. Paragraph-Revision showed low relation- 
ships with other variables; it is uncertain what 
is being measured by this type of material. 

5. The verbal factor is the only identifiable 
factor measured by the Objective English sec- 
tion and the Essay section. 

6. Instructors’ ratings and course grades 
are highly correlated (. 82), indicating that stable 
criteria were attained. Intercorrelations with 
other variables yield no evidence to support the 
indication in Study I that the two criteria are 
measuring different traits. 

7. The Verbal test is more closely related 
to writing ability as defined in this study (cor- 
relations of .76 and . 77) than is any other var- 
iable. The other variables, when combined in 
a multiple-regression equation with the Verbal 
test, fail to add appreciably to the relationship 
to writing ability demonstrated by the Verbal 
test alone. There is no support for the effect- 
iveness of the English variables in measuring 
any language ability other than the verbal fac- 
tor. 


F. General Conclusions 





The investigation points to the conclusion that 
in the light of present knowledge, measurable 
‘ability to write’’ is no more than verbal abil- 
ity. It has been impossible to demonstrate by 
the techniques of this study that essay questions, 
objective questions, or paragraph-revision ex- 
ercises contain any factor other than verbal; 
furthermore, these types of questions measure 
writing ability less well than does a typical 
verbal test. The high degree of success of the 
verbal test is, however, a significant outcome. 

The results are discouraging to those who 
would like to develop reliable and valid essay 
examinations in English composition—a hope 
that is now more than half a century old. Im- 
provement in such essay tests has been possible 
up to a certain point, but professional workers 
have long since reached what appears to be a 
stone wall blocking future progress. New basic 
knowledge of human capacities will have to be 
unearthed before better tests can be made or 
more satisfactory criteria developed. To this 
end the Educational Testing Service has pro- 
posed, pending availability of appropriate funds, 
a comprehensive factor study in which many 
types of exercises both new and traditional are 
combined with tests of many established factors 
in an attempt to discover the fundamental nature 
of writing ability. The present writer would 
like to endorse such a study as the only auspic- 
ious means of adding to our knowledge in this 
field. Even then, it appears unlikely that sig- 
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nificant progress can be made without further 
explorations in the area of personality meas- 
urement. 
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APPENDIX I 
SAMPLE QUESTIONS IN OBJECTIVE ENGLISH 


The following items have been released to illustrate the objective English material used in the inves- 
tigation. Because of the security regulations governing the confidential tests of the College Entrance 
Examination Board, it is not possible to divulge all of the items. An answer key is provided, together 
with the classifications of the items. 


Directions: ‘This section tests your ability to write formal English effectively as well as correctly. In 
each of the sentences in this section, certain portions are underlined and numbered. On the right-hand 
side of the page are suggested several ways of writing or punctuating each underlined portion, or several 
positions in which to place it in the sentence. Choose the answer which is correct, or which sounds best, 
and blacken the space beneath the corresponding number on the appropriate line on the answer page. The 
answer page which you will use for the following questions is the front cover of your blue answer booklet. 
NO CREDIT WILL BE ALLOWED FOR ANYTHING WRITTEN IN THIS TEST BOOKLET OR FOR MULTI- 
PLE ANSWERS. 

In some cases ‘‘OMIT”’ is given as a possible answer; choice of this answer means that you think it 
would be better to eliminate the underlined portion entirely than to take any of the other alternatives given. 


-s** + * 


I can scarcely be enthusiastic about his skill as . (1) as (2) of being (3) as regards 
1 being (4) for 
a violinist, but I must admit that he is somewhat . (1) somewhat (2) some 
2 
better than many musicians who are more popu- 


lar. 


e+? 


The success of any experiment is jeopardized 
. (1) imaccuracy one (2) inaccuracy, one 


(2) inaccuracy; one 


by inaccuracy one error may invalidate all the 
3 


conclusions found. 
4 


*eee?* 
About 1890 Arthur Balfour suggested, in his 
genially devastating way that human Pe is 
not founded on aout, which progresses but on 
feeling and instinct, which remain sotiet un- 


changed 


eee 
On July 4, 1776, the American colonies declared 
themselves completely independent from British 
9 


rule. 
+e 


We could not obtain the book any place in town. 
10 


. (1) found (2) which are found 


(3) OMIT 


. (1) , in (2) in 
. (1) way (2) way, 
. (1) progresses (2) progresses, 


. (1) instinct, (2) instinct 


. (1) from (2) of 


. (1) any place (2) nowhere 


(3) anywhere (4) anywheres 
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No one was much taken by the applicant’s per- 


sonality, though his credentials indicated that 
11 


he was pleasant, intelligent, and friendly. 
+*oee 
The issue is one for we students to decide; it 


12 
ought not to be left to the student council alone. 


eee 


Having decided to install the new machines, we 


discovered after several months that we would 
13 


have to wait at least a year before they 


were obtainable. 
14 


eeree 


The atomic theory is not new, but itis more 
15 16 
useful in its present form than it was when first 


conceived, and the modern theory was the pro- 
17 
duct of many men’s work, but only a few mod- 
18 
modern scientists deserve the credit for its 


present usefulness. 


eee 
Shakespeare is the most universally loved 


of all poets. 
19 


soe% 


The organization of the United Nations is more 


complex than the League of Nations. 
20 


eee 
Our staff reported yesterday that they have not 


yet deciphered the message. 
21 





+44 
Some day we must undertake sucha project, 


why not now? 
22 


. (1), though (2) : though 


(3) ; though (4). Though 


. (1) we (2) us 


. (1) would (2) will 


. (1) were (2) are (3) will be 


(4) would be 


. (1) The (2) Although the (3) In spite 


of the fact that the 


. (1) but (2) yet (3) OMIT 
. (1) and (2) and although 


(3) furthermore 


. (1) but (2) yet again (3) and 


(4) OMIT 


. (1) of all poets (2) of any poet 


(3) of any poets (4) of any other poets 


. (1) than (2) than was (3) than that of 


(4) than those of 


. (1) have not yet deciphered 


(2) had not yet deciphered 
(3) did not yet decipher 


. (1), why (2) ,—why (3). Why 
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Seton is trying to formulate a general law which 


would hold for all civilizations. 23. (1) for (2) in the case of 
23 (3) regarding (4) in regard to 


Answer Key and Item Classifications 





. 1 idiomatic expression 
1 idiomatic expression 
3 sentence structure 

3 sentence structure 

1 punctuation 

2 punctuation 

2 punctuation 

. 1 punctuation 

. 2 idiomatic expression 
. 3 idiomatic expression 
. 1 sentence structure 

. 2 grammar 

. 1 grammar 

. 4grammar 

. 2 sentence structure 

. 3 sentence structure 

. 2 sentence structure 

. 4 sentence structure 

. 1 grammar 

. 3 grammar 

. 2 grammar 

. 3 punctuation 

. 1 idiomatic expression 


CHONIMNSWNe 





March, 1954) HUDDLESTON 


APPENDIX II 
SAMPLE VERBAL ITEMS 


The following samples from the C. E. E. B. Bulletin of Information illustrate the variety of verbal ma- 
terial included in the Verbal test in Study I]. An answer key is provided. 


eee © 


Directions: Each question in this subtest consists of a group of four words, two of which are approxi- 
mately opposite to each other in meaning. Decide which two words in each group are most nearly oppos- 
ite, and blacken the space beneath the corresponding pair of numbers on the answer sheet; i.e., mark the 
space between the dotted lines beneath ‘‘1-2’’ if words numbered 1 and 2 are opposite, beneath ‘‘2-4’’ if 
words 2 and 4 are opposite, beneath ‘‘3-4’’ if words 3 and 4 are opposite, etc. Mark only ONE set of dot- 
ted lines for each question, and be sure all your marks are heavy and black. 


l-essential 2-classic 3-superfluous 4-disarming 
1-qualified 2-unfit 3-healthful 4-primitive 

1-fleeting 2-impenetrable 3-permeable 4-perjured 
1-circumscribed 2-tedious 3-senile 4-interesting 
l-unwitting 2-serious 3-deliberate 4-mollified 
l-authentic 2-mechanical 3-spurious 4-productive 
1-dispassionate 2-illustrious 3-impecunious 4-affluent 
l-resilient 2-perspicacious 3-salient 4-inconspicuous 


SAPMP ee 


*eeee * 


. Directions: Each of the questions in this subtest consists of two words which have a certain relationship 
to each other, followed by five numbered pairs of related words. Select the numbered pair of words which 
are related to each other in the same way as the original pair of words are related to each other. Then, 
on the answer sheet, blacken the space beneath the number corresponding to the number of the pair you 
have selected. 


. OINTMENT: BURN:: 1-tears: consolation 2-consolation: grief 
3-butter: bread 4-bread: meat 5~-happiness: grief 

. EROSION: ROCKS:: 1-flatness: landscape 2-fatigue: task 
3-fasting: food 4-dissipation: character 5-forgery: signature 

. FIBER: FABRIC:: l-average: aggregate 2-nucleus: cell 
3-obdstinacy: deadlock 4-appurtenance: object 5-member: league 

. REST: FATIGUE:: 1-diploma: graduate 2-laziness: obesity 
3-pinnacle: mountain 4-relaxation: recreation 5-praise: dejection 

. SKELETON: BODY:: 1-prisoner: cell 2-law: society 3-prisoner: 
law 4-jury: sentence 5-law: jury 


eee *+# + 


Directions: In each of the sentences in this subtest there is a blank space, indicating that a word has 
been omitted. Beneath the sentence are five numbered words; from these five words you are to choose 
the one word which, when inserted in the blank space, best fits in with the meaning of the sentence as a 
whole. 


14. One of the most prevalent erroneous contentions is that 
Argentina is a country of agricultural resources 
and needs only the arrival of ambitious settlers. 


1-modernized 2-flourishing 3-undeveloped 4-waning 
5-limited 


. The last official statistics for the town indicated the presence 
of 24, 212 Italians, 6450 Magyars, and 2315 Germans, which 
ensures to the a numerical preponderance. 
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1-Germans 2-figures 3-town 4-Magyars 5-Italians 


. Precision of wording is necessary in good writing; by choosing 
words that exactly convey the desired meaning, one can 
QUES vic ic naswns ‘ 


1-duplicity 2-incongruity 3-complexity 4-ambiguity 
5-implications 


. Various civilians of the liberal school in the British Parliament 
remonstrated that there were no grounds for .......... of 
French aggression, since the Emperor showed less disposition 
to augment the navy than had Louis Philippe. 


l-suppression 2-retaliation 3-apprehension 4-concealment 
5-commencement 


*-*ee © 


Directions: In this subtest each passage is followed by questions based upon its content. Each question 
consists of an incomplete statement followed by five suggested completions, only one of which is correct. 
After reading a passage, answer each of the questions following it by choosing the correct completions 
and blackening the space beneath the corresponding number on the answer sheet. 

The questions following a passage are to be answered on the basis of what is stated or implied in that 


passage. 





To regard peace as something which Europe might have and has 
wantonly chosen not to have is naively to ignore every important 
aspect of Europe’s development. The work Europe has done in 


building up civilization and culture was bound to be accompanied by 
strong passions, because only strong passions can supply the dyn- 
amic force which enables man to proceed to these activities in spite 
of the inertia which induces him to leave things as they are. The 
fact that this development has not produced peace is not to be at- 
tributed to a lack of trying. In this world, where it is hard enough 

to understand those within arm’s length, it is nearly impossible to 
understand those who are beyond our sight or those who are not 
explained to us by ties of birth. For England to have understood 

the American revolutionaries or for the American revolutionaries 

to have understood England would have required more imagination 
and insight on the part of both populations than it would be possible 
to expect. International relationships are destined to be clumsy 
gestures based on imperfect knowledge. Since peace is determined 
by international relationships, it follows that until man has succeeded 
in improving his understanding, frequent war is preordained. If we 
say that Europe is a failure because out of these battles against the 
immense forces of human fallibility there comes war, then we are not 
being fair. We had better commend Europe for the effort it has made 
and not deprive it of the just reason for pride it may have in the pro- 
gress it has managed to accomplish. 


18. The temptation to ‘‘leave things as they are’’ 

1-prevents any possibility of peace in the near future 

2-was a guiding principle for England during the American 
Revolution 

3-is one of the reasons for the present state of international 
confusion 

4-explains why Europe has been so slow in its development 

5-is an obstacle that has to be overcome in the development 
of civilization 
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. The author believes that those who call Europe a failure 
l-are in effect condemning man himself for his frailties 
2-do not realize that frequent war is inevitable 
3-are being realistic about the state of international rela- 


tionships 

4-are ignoring the position of the smaller nations in the 
general scheme 

5-are justified in believing that the European peoples 
should exert a greater effort to establish peace 


. The author believes that Europe should not be wholly con- 
demned for its constant wars because 

1-civilization advances through war 

2-dynamic struggles are inevitable in any effort toward 
cultural development 

3-world peace can never be attained 

4-the individual nations in Europe have been unwilling 
to compromise 

5-international understanding hinges on more than phys- 
ical proximity 


. The author feels that the progress Europe ‘‘has managed to 
accomplish’’ has been made in 
1-trying to overcome barriers of distance 
2-defeating the forces of human fallibility 
3-laying the foundation for world-wide understanding 
4-developing its civilization 
5-establishing a workable basis for national peace 


see © 


Answer Key 


! ' 
wn ew 


' 
>» & OO hb 
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AN EXPERIMENTAL EVALUATION OF THE 
EFFICACY OF TWO METHODS OF TEACH- 
ING MUSIC APPRECIATION 


MORTON J. KESTON 
University of New Mexico 


The Problem 


THE PURPOSE of this experiment was 
to provide an experimental basis for judging the 
relative superiority of two different points of 
view in the teaching of music appréciation. Some 
music educators maintain that it is exposure to 
music alone which may produce the relevant 
response termed music appreciation; other mus- 
ic educators insist that in order to be effective, 
music must be presented together with careful- 
ly planned comments. The null hypothesis of 
this experiment is therefore: There is no sig- 
nificant difference between experimental and 
control classes in the development of music 
appreciation. The experimental group desig- 
nates those students who listened to music to- 
gether with comments. The control group des- 
ignates those students who listened to these 
same musical compositions without any com- 
ment other than the title of the music to be 
heard. 


The Design of the Experiment 





Eighty-nine sophomores, juniors, and sen- 
iors of the University High School of the Univer- 
sity of Minnesota volunteered to relinquish their 
study periods in order to join the experiment. 
-In order to avoid bias in the formation of exper- 
imental and control groups, the names of all 
students in each class period were gathered in- 
to a single group, arranged alphabetically, and 
then subdivided alternately into two groups, one 
of which, randomly determined, became the 
experimental group, and the other one the con- 
trol group. In addition, a zero control group 
was set up. A near-by high school choir of 24 
students was tested for music preference in 
September at the beginning of the school year 
and again in May toward the end of the school 
year. The zero control group was used to pro- 
vide a basis for estimating what changes in 
musical preference may be expected from the 
environmental factors not controlled in the ex- 
perimental situation. 





The over-all procedure was to determine the 
music preferences and other related measures 
of these students at the beginning of the school 
year, subject the several groups to the differ- 
ential treatment for an entire school year, and 
then re-measure the music preferences and 
other variables to note significant differences, 
if any, as a result of the differential treatment. 

The activity of the control groups consisted 
only of listening to records, the titles of which 
were announced before the records were played. 
The experimental groups, however, not only 
heard these records, but, in addition, were 
subjected to lecture material designed to arouse 
interest in the music to be heard. Approximate- 
ly half of the class time of the experimental 
group was devoted to listening to music, and 
the other half was given over to a discussion of 
the music. The organization of recorded ma- 
terial presented to both control and experiment~ 
al groups was chronological, beginning with the 
works of Bach and his predecessors and ending 
with the outstanding compositions of the twen- 
tieth century. An extensive testing program 
took place at the beginning of the school year 
and again at the endjof the school year. 

The following tests were administered to the 
experimental and control groups: 


1. Oregon Music Discrimination Test. This test 
is administered by means of phonograph record- 
ings. The student is required to judge which 
of a pair of piano selections he prefers. One 

of the selections is taken from the original 
composition; the other member of the pair is a 
distortion of the original composition. There 
are forty-eight such pairs. This test, presum- 
ably a music discrimination test, was given 
both at the beginning and at the end of the year. 





2. Seashore Measures of Musical Talents. This 
test is also administered by means of phono- 
graph recordings. Pitch, Rhythm, and Tonal 
Memory, three of the six Seashore measures 
of this battery, were used. Loudness, Time, 
and Timbre, the other three measures of the 
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battery, were eliminated from the study be- 
cause these qualities seemed remote from the 
factor under investigation, the development of 
musical taste. 


3. Kwalwasser-Ruch Test of Musical Accom- 
plishment. This is a paper-and-pencil test 
and measures musical knowledge of such factors 
as musical symbols, time signatures, note val- 
ues and the like. 





4. Kwalwasser Test of Musical Information and 
Appreciation. This is also a pencil-and-paper 
test and differs from the preceding one in that 
it measures knowledge of compositions, com - 
posers, and orchestral instruments. 





5. Otis Quick Scoring Mental Ability Test. This 
test was administered in order to obtain a meas- 
ure of the intelligence of each student in the 
study. 





6. Keston Music Preference Test. Because it 
was suspected that the Oregon Music Discrim- 
ination Test would not adequately measure the 
crucial variable in this experiment, and because 
no other test which could measure the musical 
discrimination necessary for this study was 
available, such a test was designed. The Music 
Preference Test when completed consisted of 
thirty groupings or items of four selections 
each, recorded on fifteen double-faced acetate 
record discs. Each musical selection lasted 
forty-five seconds, so that each record side 
contained approximately three minutes of 
music. Each test item included four categories 
of music: (1) Category A, serious classical; 
(2) Category B, popular classical or ‘‘pop con- 
cert’’ music; (3) Category C, dinner music; and 
(4) Category D, ‘‘swing’’. As there are n!, i.e., 
4x3 x2 or 24, possible arrangements of the 
four music selections, the first twenty-four 
groupings each had a different order of present- 
ation. The order of each of the remaining six 
items was chosen at random from these first 
twenty-four arrangements. The categories A, 
B, C, and D were chosen in the expectation that 
serious musicians whowere to be selectedas the 
expert group would indicate a rank preference 
for the selections in each item in the category 
order A, B, C, D. The average student, how- 
ever, not discriminating as to musical taste, 
would depart widely from the experts in his 
choices. These expectations were borne out in 
the administration of the test to various groups 
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as illustrated in Table I. 

The scoring of the test proved to beasource 
of considerable difficulty, for weightings had 
to be determined in such a way that a subject 
would receive some credit for judgments which 
were approximately ‘‘correct’’. A method of 
difference was finally adopted which measured 
the degree of departure of the subject from the 
judgments of the music authorities. 1 Therefore 
the higher the score of a given individual, the 
greater his departure from the opinion of the 
experts, and, consequently, the lower his rat- 
ing on the Music Preference Test. Scores 
range from a theoretical perfect score of 0 to 
a theoretical maximum score of 159.6. It may 
be noted that even the expert group had an av- 
erage score of 25. This is because these ex- 
perts reflect minor differences among them - 
selves in musical taste, and they departed, on 
the average, 25 points from a system of class- 
ification used in the construction of the test. 
Several of them, however, came within 15 points 
of the theoretically perfect score. 

An examination of Table I demonstrates the 
validity of the scale. Operationally, therefore, 
with the use of this test, it was possible to de- 
fine musical judgment quantitatively according 
to the following simple principle: to the extent 
individuals agreed in their musical judgments 
with music authorities or advanced music stu- 
dents, their musical judgments were superior; 
to the extent these students disagreed with the 
experts, their musical judgments were inferior. 
The consistently low scores of the music groups 
as contrasted with the high scores of the non- 
music groups established the validity of the test. 

The reliability of the test was determined by 
an analysis of test-retest data according to an 
application of aaalysis of variance technique of 
Jackson and Ferguson (3). The results of this 
analysis indicated the test to be a sensitive one 
with an over-all reliability of . 95. 

In addition, an investigation of the consis- 
tency of the subjects’ ranking was made by an 
analysis of the incidence of circular triads. In 
order to do this, the musical excerpts of each 
item were presented to the subjects in the form 
of paired comparisons. Thus, if a subject pre- 
ferred excerpt A to excerpt B, excerpt B to ex- 
cerpt C, and then indicated a preference for ex- 
cerpt C over excerpt A, an inconsistency or 
circular triad would be present. Of 1302 items 
analyzed, 151 or 11.5 percent were found to 
contain a circular triad. The Music Prefer- 
ence Test used in this study was therefore 





1. The method of grading the 


ic Preference Test as well as the determination of the validity 





Mus 
end reliability of the scale will be described in full in a forthcoming Psychological Mono- 
graph, "The Development of a Test of Musical Preference." 
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TABLE I 


MEANS AND RANGES OF MUSIC PREFERENCE TEST SCORES OF EXPERT GROUPS 
AND OF JUNIOR AND SENIOR HIGH SCHOOL STUDENTS 





Group 





Music Authorities 

Music Students 

(Sophomores, University of Minnesota) 
Music Students 

(Seniors, University of Minnesota) 
Music Students 

(Graduates, University of Minnesota) 





Senior High School Students 

(Junicr High School, University of Minne- 
sota) 

Junior High School Students 

(University High School, University of 
Minnesota) 





TABLE I 


MEANS AND RANGES OF MUSIC RECOGNITION TEST 
SCORES OF SEVERAL GROUPS 





Group Number Mean 





Students of Study 89 6. 93 
University High School 

(Junior High) 58 2.29 
Marshall High School 

(Senior Choir) 24 6. 04 
Marshall High School 

(Junior High) 36 
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found to be internally consistent as well as reli- 
able, for approximately only one item out of 
every ten items manifested a circular triad. A 
chi-square test revealed no significant differ- 
ence between the experimental groups and con- 
trol groups in the proportion of circular triads 
found. 2 

The Music Preference Test was administer- 
ed to the experimental groups, control groups, 
and zero control group at the beginning of the 
school year and again at the end of the school 
year in order to note any possible shift in musi- 
cal preference as a result of the differential 
treatment. A significant shift was found, and 
this paper attempts an analysis of this shift and 
the part played by diverse factors in the individ- 
uals who took part in the experiment. 


7. Keston Music Recognition Test. The ability 
to identify musical compositions is generally 
granted little importance in the development of 
musical taste, for the elements of musical taste 
are musical judgments, not musical facts. How- 
ever, it was considered of interest in this study 
to collect some data on the ability to recognize 
music in order to note its relationship or lack 
of relationship to the development of musical 
discrimination. 

Again, however, no test was available, and 
one had to be constructed. The best knowncom- 
positions of each of 30 composers were selected 
and re-recorded on acetate discs. In the admin- 
istration of this test, the subject listens to ex- 
cerpts which last forty-five seconds and indi- 
cates by number which of 34 listed names cor- 
responds to the composer of each excerpt. One 
point is allowed for each correct response; the 
minimum score is therefore 0, and the maxi~ 
mum score is 30. The simplicity and objectiv- 
ity of the test assured its validity. The reliabil- 
ity of the test was determined according to the 
technique used for the Music Preference Test. 
The Music Recognition Test was found to be a 
sensitive one with an over-all reliability of .93. 
Table II includes the means and ranges of sev- 
eral of the groups tested. 

In addition, information regarding the grade- 
point average and socio-economic status of the 
students was gathered from the office records. 





The Statistical Analysis of the Music Prefer- 
ence Test Scores 








1. Test for pooling: The data of this exper- 
iment were collected from three experimental 
groups and three control groups. Analysis of 
variance applied to the scores of the final Music 
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Preference Test demonstrated that it was sta- 
tistically permissible to pool these groups into 
a single experimental group and a single control 
group, i.e., the groups to be pooled were found 
to be homogeneous with respect to final Music 
Preference Test scores. 

2. Test for normality: A basic assumption 
of the analyses was normality of the data. The 
data of the pooled experimental group and the 
pooled control group were subjected to the pro- 
bit test of normality. Each Music Preference 
Test score was assigned a percentage in the dis- 
tribution by the use of the formula (2n~-1)100 

2N 


where n is the rank of a given Music Preference 
Test score in the group, and N is the totalnum- 
ber of scores. These percentages were con- 
verted to probits by referring to the appropriate 
table. (Table [X in Fisher and Yates, ref., 1) 
When the final Music Preference Test scores 
were plotted against the probit values for each 
score, the resulting points approximated a 
straight line in both the experimental group and 
the control group. This is the criterion of the 
probit test of normality, the formation of a 
straight line when the scores are plotted as ord- 
inates and the probits as abscissae. 

3. The analysis of variance and covariance: 
The analysis of variance and covariance is par- 
ticularly applicable in the analysis of these data, 
for the fundamental question to be ansered is: 
Have there been any significant changes in the 
final scores of the two groups when the original 
scores have been taken into account? An anal- 
ysis of the initial Music Preference Test scores 
indicated that the experimental group and the 
control group were significantly different with 
respect to music preference ratings, i.e., the 
groups were unmatched. However, analysis of 
variance and covariance renders obsolete the 
necessity for rigidly matched groups, because 
by means of this statistical technique, the ef - 
fect of the inequalities between original music 
preference ratings of the experimental and con- 
trol groups may be eliminated as a variable in 
the experiment. This is accomplished by ad- 
justing the sum of squares of the final Music 
Preference Test scores, so that the effect of 
the inequalities in the initial Music Preference 
Test scores of the two groups on the finalscores 
is removed or eliminated statistically. Simil- 
arly, the effects of inequalities of other factors 
such as intelligence, socio-economic status, 
and the like on the final Music Preference Test 
scores may be eliminated by appropriate ad- 
justments. These adjustments on the final 
sums of squares of the variable under consid- 





2. An article devoted exclusively to the problem of the circular triads found in this stuly will 


be published in the near future. 
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eration come about through the analysis of the 
sum of the cross products of the variable under 
consideration and the other variable whose ef - 
fect is to be eliminated. 

In the first analysis of variance and covari- 
ance to be performed, the null hypothesis is: 
there is no significant difference in the means 
of the final Music Preference Test scores of the 
experimental group and the control group when 
the effect of the inequalities of the initial Music 
Preference Test scores of the two groups is 
eliminated, Table II] includes the data neces- 
sary for the analysis. The initial test scores 
are referred tu as X; the final test scores are 
referred to as Y. 

A basic assumption in the analysis of covar- 
iance is that the regression coefficients within 
groups be homugeneous. Where three or more 
groups are involved, an appropriate test is the 
Welch-Nayer L, test. In this case, however, 
two groups are involved, and an appropriate test 
is a t test, a test of the hypothesis that two with- 
in regression coefficients, b, and b,, obtained 
from two random samples of sizes N, and N, 
are from the same population. The value of t, 
.63, calculated from the data, was not signifi- 
cant, and the condition of homogeneity of regres- 
sion coefficients was satisfied. The assumption 
of homogeneity of variances within groups was 
confirmed by the F-test. 

Table IV represents the analysis of variance 
and covariance indicating the adjusted or cor- 
rected sum of squares. The total sums of pro- 
ducts of deviations have been analyzed in order 
to eliminate or remove the effect of the initial 
Music Preference Test scores on the final Mus- 
ic Preference Test scores. 

The F value obtained in Table IV is beyond 
the tabled value for the 1 percent level of signif- 
icance, and the hypothesis may be rejected. In 
terms of the experiment performed, this indi- 
cates that the method of instruction in music ap- 
preciation received by the experimental group 
was superior to that received by the control 
group. 

It is relevant at this point to run a similar 
analysis of variance and covariance on the in- 
itial and final Oregon Music Discrimination Test 
scores to note whether the differential result 
obtained in the Music Preference Test was ob- 
tained for the Oregon Music Discrimination Test 
as well. Table V is the analysis of varianceand 
covariance table which tests the null hypothesis: 
there is no significant difference in the means 
of the final Oregon Music Discrimination Test 
scores of the experimental group and the con- 
trol group when the effect of the inequalities of 
scores on the initial Oregon Music Discromin- 
ation Test is eliminated. 

The F value obtained is not significantat the 
5 percent level, and the results of this analysis 
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indicate no significant difference between the 
adjusted means of the experimental group and 
control group. Inasmuch as a significant differ- 
ence was found in the analysis of the Music Pref- 
erence Test scores, evidently the Oregon Mus- 
ic Discrimination Test measures a different 
capacity than the Music Preference Test. This 
comparison of the two music discrimination 
tests indicates that the decision not to use the 
Oregon Music Discrimination Test as an im- 
portant testing device in this experiment was a 
correct one. 

It is possible to carry out the analysis vari- 
ance and covariance in such a way that the sum 
of squares of the dependent variable about the 
mean may be adjusted or freed from the effects 
of several factors simultaneously. Theoretic- 
ally, any number of independent variables may 
be included in such an analysis, but the com- 
plexity of the process increases with the addi- 
tion of each independent variable. A practical 
limit for the number of independent variables 
to be included in such an analysis would be two. 
In this experiment data were collected on ten 
variables other than musical preference. An 
appropriate question at this point would be to 
ask what influence each of these factors had on 
the final Music Preference Test scores. In or- 
der to do this, the final Music Preference Test 
scores would have to be freed from the influence 
of the initial Music Preference Test scores and, 
at the same time, from each of the ten vari- 


ables. 
It would be pertinent, for example, toinquire 


whether or not the factor of music recognition 
was influential in determining the scores of the 
final Music Preference Test. The null hypoth- 
esis to be tested was: there is no significant 
difference in the meaas of the final Music Pref- 
erence Test scores when these mean scores 
have been adjusted for any inequalities in the 
two groups with respect to both factors, the in- 
itial Music Preference Test scores and the in- 
itial Music Recognition Test scores. The set 
of equations used in this analysis is: 


Ly? = A; Ux® = B; Dz* = C; Lyx = D; Lyz = E; 
2xz = F, and 


_CD-FED 
= BC - F? 


N - BE* - FDE 

BC -F 
The adjusted Ly" in which the dependent vari- 
able is freed from the influence of both factors 
is then equal to A- M - N. 

Table VI is the analysis of variance and co- 
variance table fro the final Music Preference 
Test scores with both initial Music Preference 
scores and initial Recognition Test scores held 
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constant. 

The F value obtained was beyond the tabled 
value for the one percent level of significance, 
so there was a significant difference between 
the means of the two groups even when the effects 
due to the influence of the initial Music Prefer- 
ence Test and the initial Music Recognition Test 
were removed. 

In this study, data were collected on fourtcen 
variables. These are: 


. Final Music Preference Test 

. Initial Music Preference Test 

. Initial Music Recognition Test 

. Final Music Recognition Test 

. Musical Accomplishment Test 

Musical Information Test 

. Initial Oregon Music Discrimination Test 
. Final Oregon Music Discrimination Test 
. Pitch 

. Tonal Memory 

. Rhythm 

. Intelligence Cuotient 

. Grade-Point Average 

. Socio-Economic Status 


COoOnoavk WNWe 


Ten separate analyses of variance and covar- 
iance with two independent variable were car- 
ried out. In each case, the initial Music Pref- 
erence Test and one other variable were held 
constant in order to free the final Music Pr ef- 
ence Test scores of the effects of the inequal- 
ities in these various variables, taken two ata 
time. Table VII is a summary table including 
all F ratios found and the conclusions drawn 
from the ten analyses of variance and covariance. 
In all ten analyses, the F ratios were beyond 
the tabled values for F at the one percent level 
of significance. This substantiates the conclu- 
sion that the method of instruction in music ap- 
preciation in which students listen both to mus- 
ic and discussion of music is superior to the 
method of instruction in which they listen to 
music alone. Further, this conslusion is now 
based on differences of the means of the final 
Music Preference Test scores after these means 
have been adjusted for any differences bet ween 
the two groups in each of the ten factors com - 
bined with the initial Music Preference Test 
score used in the analysis of variance and covar- 
iance. 


The Zero Control Group 





The Music Preference Test was administered 
to this group of 24 students at the beginning of the 
school year and again toward the end of the school 
year in order to provide a basis for estimating 





KESTON 223 


what changes in musical preference may be ex- 
pected from the environmental factors not con- 
trolled in the experimental situation. The ap- 
propriate statistical test in this situation is a 
modified t test, because a high correlation is 
present between the initial and the final scores. 
The obtained t, value of 3.46 was beyond the 
tabled value for t at the one percent level of 
significance, and the null hypothesis that there 
is no significant difference between the scores 
at the beginning of the year and at the end of the 
year must be rejected. 3 

The mean of this group on the Music Prefer- 
ence Test at the beginning of, the year was 138.71; 
at the end of the year it had dropped roughly 
six points to 132.50. The latter mean remains 
high and indicates a decided preference for pop- 
ular music in the group as a whole, The mean 
of the experimental group of this study shifted 
approximately sixteen points from a mean of 
79. 3% during the year. However, thet, test 
indicates a significant change inthe zero 
control group, and this improvement in the mus- 
ical preference of the group may have been the 
result of the activities of the students as mem- 
bers of a high school choir. For this reason, 
the design of this experiment may perhaps have 
been improved by including an additional zero 
control group which did not participate in any 
formal musical activity. 


The Statistical Analysis of the Music Recog- 
nition Test Scores 








i. Test for pooling: As in the case of the 
Music Preference Test scores, it was found 
permissible to group these scores into a single 
experimental group and a single control group. 

2. Test for normality: The two groups were 
subjected to a probit test of normality and found 
to oe normal. 

3. The analysis of variance and covariance: 
Table VIII tests the null hypothesis: there is no 
significant difference between the final Music 
Recognition Test scores of the experimental 
group and the control group when the initial 
test scores are held constant. The F ratio of 
6. 09 lies within the region of doubt, for it is 
larger than the tabled value for the 5 percent 
level, but smaller than the value for the one per- 
cent level. The null hypothesis may be rejected 
at the 5 percent level of significance but not at 
the 1 percent level of significance. 

Analyses of variance and covariance on the 
Music Recognition Test scores with two inde- 
pendent variables were carried out just as was 
done in the preceding section with the Music 
Preference Test scores. Table IX is the sum- 





3. The tabled value for t with N-1lor 23 degrees of freedom is 2.807 at the 1 percent level. 
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TABLE Vil 


ANALYSIS OF VARIANCE AND COVARIANCE OF MUSIC RECOGNITION TEST SCORES 
HOLDING INITIAL MUSIC RECOGNITION TEST SCORES CONSTANT 





Adjusted or Reduced 





Source of P Sum of Mean 
Variance df =r tri Zr.r, Squares Square F* 


Within 87 4444.92 2437. 45 2590. 71 1691. 32 20. 38 





6. 09 
Between 1 369. 88 56.15 144. 11 124. 11 124, 11 





Total 88 4814.80 2493.60 2734.82 1815. 43 





*3. 96 at 5% level; 6. 96 at 1% level. 


TABLE X 


SUMMARY OF f RATIOS AND CONCLUSIONS OF TEN ANALYSES OF VARIANCE AND COVARIANCE 
OF FINAL MUSIC RECOGNITION TEST SCORES WITH TWO INDEPENDENT VARIABLES 





Dependent Variable Independent Variable Conclusion 





1, Final MRT Initial Music Recognition Test 
Initial Music Preference Test a Region of doubt 


. Final MRT Initial Music Recognition Test 
Musical Accomplishment Test . Not significant 


. Final MRT Initial Music Recognition Test 
Musical Information Test e Region of doubt 


. Final MRT Initial Music Recognition Test 
Initial Oregon Music Discrimination Test b Significant 


. Final MRT Initial Music Recognition Test 
Pitch > Region of doubt 


. Final MRT Initial Music Recognition Test 
Tonal Memory ‘ Significant 


. Final MRT Initial Music Recognition Test 
Rhythm . Not significant 


. Final MRT Initial Music Recognition Test 
Intelligence Quotient ‘ Region of doubt 


. Final MRT Initial Music Recognition Test 
Grade-Point Average ° Significant 


. Final MRT Initial Music Recognition Test 
Socio-Economic Status . Region of doubt 
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mary table of all F ratios found and the result- 
ing conclusions. 

The F ratios of Table [x do not permit the 
sweeping generalization which was possible from 
the F ratios of Table VII. In the analyses using 
the Music Recognition Test scores, five of the 
ten F ratios fall within the region of doubt and 
two others are clearly not significant. This find- 
ing, however, does not detract from the import- 
ant conclusion of the study. The basis of music 
appreciation is a relevant response to music, 
and this response involves value judgments. The 
knowledge of facts about music such as the name 
of the composition or the composer is secondary 
and relatively unimportant. Gernet remarks in 
his study that the ability to recognize music is 
not important in the appreciation of music. (2) 
The analyses of the Music Recognition Test 
scores of Table [IX indicate some significant dif- 
ferences in the experimental and control groups, 
but the shift is not so pronounced as in the case 
of the Music Preference Test scores. 

Correlation coefficients between pairs of var- 
ious variables of the study were calculated. 
Table X contains some of the correlation coef- 
ficients found in this study. 


Summary and Conclusions 





This study was conducted in order to provide 
an experimental basis for judging which of two 
methods of teaching music appreciation is super- 
ior. One method, that of the experimental group, 
consisted of exposure to serious classical mus- 
ic together with explanatory comments and dis- 
cussion; the other method, that of the control 
group, consisted of exposure to serious classi- 
cal music without comment. In order to meas- 
ure the crucial variable, musical preference, a 
test which proved to be valid and reliable was 
constructed. A second test which measured a 
less important factor, music recognition, was 
also designed. Other factors measured were 
music accomplishment, musical training, 
musical discrimination, pitch, tonal memory, 
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rhythm, I.Q@., grade-point average, and socio- 
economic status. The statistical tool utilizedin 
the analysis of the data was analysis of variance 
and covariance. This analysis performedonthe 
final Music Preference Test scores with the in- 
itial Music Preference Test scores held constant 
revealed that there was a significant difference 
between the means of the experimental and con- 
trol groups when the final scores were adjusted 
or freed from the effects of the initial Music 
Preference Test scores. Ten additional analy- 
ses of variance and covariance were performed 
on the final Music Preference Test scores with 
both the initial Music Preference Test scores 
and each of the ten independent variables held 
constant. In every case, a significant difference 
was found between the means of the experiment- 
al and control groups on the final Music Prefer- 
ence Test scores after the necessary adjust- 
ments were made. 

The educational implication of the results of 
these analyses indicates the superiority of the 
method used in teaching the experimental group. 
The final conclusion of the study, therefore, is 
that the method of instruction in music appreci- 
ation which utilizes commentary and discussion 
aimed to develop appreciation in conjunction 
with listening to music is superior to the method 
of instruction in which music is listened to with- 
out comment, 
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FUNCTIONAL COMPETENCE IN 
MATHEMATICS 


G. DON ALKIRE 
Fresno State College 
Fresno, California 


IN RECENT years the mathematics pro- 
gram of the secondary school has been subject- 
ed to frequent criticism. Meanwhile, in our 
technological society, the need for mathemati- 
cal proficiency has been expanding. In an at- 
tempt to resolve some of the issues raised, a 
first step might well be the definition and anal- 
ysis of the present situation. Need is evident 
for carefully conducted and comprehensive re- 
search in that area. The present study was un- 
dertaken with the intention of contributing toward 
the satisfaction of that need. 


Statement of the Problem 





The problem was to determine some charac- 
teristics of the mathematics program in South 
Dakota’s secondary schools during the academic 
year of 1951-1952, and to investigate the rela- 
tion of functional competence in mathematics to 
certain factors resident in the pupil, inthe school, 
and in the teacher. The study dealt with the com- 
posite of all the mathematics courses taken in 
the secondary school and was concerned with 
functional competence in mathematics of stud- 
ents in the fourth year of high school. 


Delimitations of the Problem 





The primary purpose of the study was to se- 
cure statistically verified evidence concerning 
the following questions: 


1. Which of certain factors resident in the 
pupil are significantly related to his functional 
competence in mathematics? Specifically, when 
means of functional-competence test scores are 
adjusted for inequalities in mental scores (devi- 
ation I. Q’s) among the groups under considera- 
tion, does the adjusted functional-competence 
mean: 

a. of either sex exceed that of the other sex? 

b. of pupils who received their arithmetic 





training in rural schools differ significant- 
ly from that of pupils who received their 
arithemetic training in urban schools? 

. Of pupils who plan to attend college differ 
significantly from that of pupils who have 
not formed such plans? 

. Of pupils who have had one or two years 
of mathematics in high school differ sig- 
nificantly from that of pupils who have had 
more than two years of mathematics in 
high school ? 

. of pupils whose grade-point average in 
mathematics courses places them in the 
lower one-fourth of the distribution of 
grade-point averages in mathematics dif- 
fer significantly from that of pupils whose 
grade-point average in mathematics places 
them in the upper one-fourth? 

. of pupils whose rank places them in the 
lower one-fourth of the graduating class 
differ significantly from that of pupils whose 
rank places them in the upper one-fourth? 


2. Which of certain factors resident in the 
school are significantly related to the function- 
al competence in mathematics of its pupils? In 
particular, when means of functional-competent 
test scores are adjusted for inequalities in ment- 
al scores among the groups under consideration, 
are there any significant differences in adjusted 
functional-competence means among: 


a. schools enrolling less than 100 pupils, 
those enrolling from 100 to 500 pupils, 
and those enrolling 500 or more? 

. Schools belonging to school districts the 
assessed valuation of which places the m 
in the upper one-fourth of the distribution 
of schools classified on the basis of asses- 
sed valuation of school districts, and 
schools belonging to school districts the 
assessed valuation of which places them 
in the lower one-fourth? 





«Summary or the Relation of Certain Factors Resident in the 





in the School, and in the Teach- 


er, to Functional Competence in , ny, 
8as, 1953. Advisor: Ronacth E. Anderson, Dean, School of Education, University of Kansas. 
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3. When the mean functional-competence 
scores are adjusted for inequalities in mental 
scores among the groups under comparison, are 
there any significant differences among the ad- 
justed functional-competence means of pupils 
attending schools, the average T-score* of 
whose teachers places the schools in the upper 
one-fourth of the distribution of schools ranked 
according to T-scores, and the adjusted func- 
tional-competence means of pupils attending 
schools, the average T-score of whose teachers 
places the schools in the lower one-fourth of the 
distribution? 


Selection of the Sample 





In order to draw relatively valid conclusions, 
it was imperative that the schools participating 
in the study (the sample) be representative of 
the high schools of South Dakota (the population). 
In order to increase the accuracy and represent- 
ativeness of the sample, the method of stratified 
random sampling was employed. Size of enroll- 
ment and type of administrative organization of 
school were selecied arbitrarily as the most 
feasible bases for stratification since these char- 
acteristics might reasonably be presumed to be 
systematically associated with the criterion var- 
iable (functional competence). The South Da- 
kota Educational Directory lists 298 secondary 
schools enrolling 6777 seniors in 1951-1952. 
Table I shows these schools classified into 
three groups on the basis of size of enrollment 
with each of these groups further subdivided in- 
to three subgroups on the basis of type of admin- 
istrative organization of school. The sample 
was to contain approximately ten percent of the 
secondary schools in South Dnkota. 

As the sample turned out, twenty four-year 
schools and one 6-6 type were chosen to repre- 
sent that class of secondary school enrolling less 
than 100; seven four-year schools and one 6-3-3 
type were chosen to represent that class of sec- 
ondary school enrolling between 100 and 500; 
and one 6-3-3 type was chosen to represent that 
class of secondary school enrolling 500 pupils 
or more. 


Data -Gathering Instruments 


Data were obtained from the thirty schools 
in the sample as follows: 





1. Data concerning the sex, mathematical 
background, and educational plans of the pupils 
participating in the study were collected by means 
of a schedule filled out by each pupil. 
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2. Data concerning the length of employment, 
the experience in teaching mathematics, and 
the academic qualifications of the teachers were 
obtained from a schedule which was filled out 
by the principal or superintendent of the school. 

3. Data concerning the mathematics courses 
offered, the units of credit for eachcourse, and 
the rank of each pupil in the graduating class 
were obtained from a schedule which was filled 
out by the superintendent or principal of the 
school. 

4. Mathematics test scores in terms of stand- 
ard scores were obtained from the Davis Test 
of Functional Competence in Mathematics. 

5. Mental test scores in terms of deviation 
I. Q.’s were obtained from the Terman-McNemar 
Test of Mental Ability. 


The Examinations 


The mental test and mathematics test are 
well known standardized tests, properly vali- 
dated and shown to possess a high degree of re- 
liability. One can have considerable confidence 
that the examinations used were measuring the 
objectives for which they were designed, and 
were doing it consistently. 

The Davis test was constructed to measure 
functional competence in grades 9 through 12 
and consists of 80 items based on the essentials 
for functional competence in mathematics as 
outlined by the Commission of Post-War Plans 
of the National Council of Teachers of Mathe- 
matics. The Commission defined functional 
competence in mathematics by a check list of 
29 items and stated that the school might turn 
out students that are functionally competent in 
mathematics if the program were built substan- 
tially on abilities and outcomes in certain areas. 
Some eminent mathematics educators have re- 
garded the Commission’s report as the most 
authoritative statement of the objectives of math- 
ematics instruction at the secondary level. 


Distribution of Scores 





The distribution of deviation I.Q.’s was 
Slightly skewed and slightly leptokuric. The 
Chi-square test showed that the distribution did 
not depart significantly from the normal distri- 
bution. The distribution of Davis standard scores 
displayed almost the same skewness and kur- 
tosis as did the distribution of mental scores, 
and the Chi-square test showed that it departed 
significantly from normality. Histograms and 
superimposed normal curves of best fit, to- 
gether with other information, indicated that 





* Each teacher was given a T-score in terms of years of teaching experience and a T-score in term 
of number of semester hours of mathematics earned in higher institutions. The average T-score 


used in the study is the mean of those two T-scores. 
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TABLE I 


DISTRIBUTION OF FIGURES PERTAINING TO ENROLLMENT AND 
ADMINISTRATIVE ORGANIZATION OF THE 298 SECONDARY 
SCHOOLS IN SOUTH DAKOTA DURING 1951-52 





Classification of School: Number Percentage 
Enrollment Organization in State of Total 





Less than 6-3-3 2 . 67 
100 203 68. 12 


pupils _ 8 2. 67 


13 71. 46 


From 100 3 1.01 
to 500 24. 50 


pupils 1. 01 
26. 52 


500 or 1.34 
more . 34 


pupils _.34 
2. 02 





Grand Total 100. 00 
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each distribution was uni-modal and exhibited 
only a very slight departure from normality. 
Fisher (3) has shown that for curves that exhib- 
it only a moderate departure from normality, 
the efficiency of certain statistical techniques 
remains reasonably high. Also, in using the 
analysis of variance and covariance, the as- 
sumptions basit to this technique were tested. 
And for the t-test and F-test used in the analy- 
sis, Cochran (2) has shown that no serious er- 
ror for a slight departure from normality is in- 
troduced in the significance levels. 


The Comparisons and Statistical Tools 
Employed 


In this study there were nine major compar- 
isons, each consisting, in most cases, of two 
or three minor comparisons. Using the tech- 
nique of analysis of variance and covariance, 
comparisons were made on the basis of the math- 
ematics tests scores, holding intelligence con- 
stant. In the event this technique could not be 
employed, the writer turned to the Behrens- 
Fisher d-test. The analyses involve compari- 
sons of groups falling within extreme categor- 
ies only, thus reducing the possibility of over- 
lap of casual factors within the groups compared 
and increasing the probability that these groups 
truly represent different segments of the popu- 
lation, 





Analysis of Variance and Covariance 





In one comparison the hypothesis to be tested 
was that there was no significant difference in 
mean functional competence in mathematics be- 
tween pupils who had planned to attend college 
and pupils who had not formulated such plans. 

In the first group (college-bound) there were 
417 pupils representing twenty-nine of the thirty 
schools in the sample; in the second group (non- 
college-bound) there were 353 pupils represent- 
ing the thirty schools. 

Before the pupil data (intelligence test scores 
and mathematics test scores) in each of these 
two groups could be pooled, two assumptions 
basic to pooling had to be fulfilled: 


1, that there was no difference between the 
groups to be pooled in regard to standard devi- 
ations; 

2. that there was no difference between the 
groups to be pooled in regard to means. 


The first assumption was tested by use of the 
Welch-Nayer test on the ‘‘sum of squares within 
groups.’’ The value for L, was obtained from 
the formula 


log L, = log N- fi Dnslog ng +1 ng log @s - log( 8) 
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and the corresponding probability was found in 
Nayer’s tables. The second assumption was 
tested by use of the F-test, in which F was found 
by dividing the mean square between the groups 
by the mean square within the groups. Entering 
Snedecor’s table, the probability corresponding 
to a given value of F was obtained. The value 
of F as found by the analysis of variance tech- 
nique assumes equality of variances of the 
groups involved. This equality was tested by 
the previously mentioned Welch-Nayer test. 

As shown in Table II, the twenty-nine sub- 
groups of pupils who were college-bound were 
homogeneous with regard to variances. It was 
found that the L, was greater than the value 
found in Nayer’s tables at the 5 percent level. 
The null hypothesis, that there was no signifi- 
cant difference between the groups to be pooled 
in regard to standard deviations, was accepted. 

Not all of the subgroups allowed themselves 
to be pooled on the basis of equality of means. 
As shown in Table III in which eighteen of the 
twenty-nine subgroups were considered, an F 
of 0.49 was obtained. Entering Snedecor’s 
table, it was found that our F was less than the 
table value at the 5 percent level. The null hy- 
pothesis, that there was no significant differ- 
ence between the means of the subgroups to be 
pooled, was accepted. These eighteen groups 
also satisfied the L, test. 

The eighteen subgroups who had planned to 
attend college, having satisfied the two criteria 
necessary to pooling, were pooled into one group 
(college-bound, Group III-A). 

In an exactly similar manner, twenty-nine 
of the thirty subgroups of pupils who had not 
planned to attend college, having satisfied the 
two criteria necessary to pooling, were pooled 
into one group (non-college-bound, Group III-A). 

We were now ina position to test the hypoth- 
esis that there was no significant difference be- 
tween the two pooled groups with regard to means 
on the mathematics test scores, holding intelli- 
gence test scores constant. The assumptions 
which had to be satisfied before the analysis of 
variance and covariance tool could be applied 
were: 


1, that there was no significant difference 
between the standard deviations of the first 
pooled group and the second pooled group. 

2. that there was no significant difference 
between the within partial regression coeffic- 
ients of the first pooled group and the second 


pooled group. 


Each of these assumptions was tested by the 
Welch-Nayer test. The first was tested by em- 
ploying the ‘‘sum of squares within groups”’ us- 
ing the mathematics test scores. The second 
was tested by using the adjusted ‘‘sum of squares 
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TABLE V 


TEST FOR HOMOGENEITY OF REGRESSION, COMPARISON III-A 





8. log @.' ng log 05' L, Hypothesis 
30958. 9907 4. 49077 Null 





29230. 9094 4. 46584 
P >. 05 


60189. 9001 4. 77952 2104, 60391 , Accept 
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TABLE VI 


DATA FOR ANALYSIS OF VARIANCE AND COVARIANCE, COMPARISON III-A 





Groups 


8x 


By 





College Bound 


Non-College Bound 


z 


30933. 0749 
38159. 7613 
69092. 8362 
83351. 0234 


19005. 6167 
22554. 5350 
41560. 1517 
55343. 0042 


42636. 2555 
42561. 8354 
85198. 0909 
98521. 4553 





TABLE VII 


ADJUSTMENT TABLE FOR ANALYSIS OF COVARIANCE, 
COMPARISON III-A 








Groups 


Correction 


Adjusted @,' 





College Bound 


Non-College Bound 


z 


11677. 2648 
13330. 9260 


24998. 7495 
36746. 3815 


30958. 9907 


29230. 9094 
60189. 9001 
60199. 3414 
61775. 0738 
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TABLE VIII 


ANALYSIS OF VARIANCE AND COVARIANCE OF DAVIS SCORES 
WITH MENTAL SCORE CONSTANT, COMPARISON III-A 





s.S. M.S. Hypothesis 





60199. 3414 128. 9065 Null 


1575. 7324 1575. 7324 12. 2238 


P.<.01 


61775. 0738 Reject 





TABLE IX 


ADJUSTED DAVIS MEANS, COMPARISON III-A 





Groups 


Diff. of X 
from G.M. Corr. 


Adjusted 


N Means 


x Y xX ¥ 





College Bound 


Non-College Bound 


Total 


227 24422 26773 107.59 117.94 ~-5.70 . 602 114.51 


243 23465 26071 96. 56 107.29 5.33 110. 50 


470 47887 101. 89 
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within groups.’’ The adjustment took into ac- 
count the effect of differences in intelligence on 
the mathematics test scores. 

As shown in Table IV, college-bound Group 
IlIl-A comprised of 227 pupils, and non-college 
bound Group III-A, comprised of 243 pupils, 
were tested for homogeneity of variances. The 
L, of . 999 was greater than the table value at 
the 5 percent level. As shown in Table V, when 
these two pooled groups were tested for homo- 
geneity of regression coefficients, there result- 
edan L, of . 998 which was greater than the table 
value at the 5 percent level. The data neces- 
sary to the analysis and adjustment are foundin 
Tables VI and VII. It was concluded that the two 
pooled groups satisfied the assumptions basic 
to the application of the technique of analysis of 
variance and covariance, 

We are now ina position to analyze results 
and determine the F ratio for the two pooled 
groups. As shown in Table VIII, an F of 12.22 
was obtained which is greater than the table 
value at the 1 percent level. The null hypoth- 
esis was rejected, and it was concluded that 
there was a significant difference between the 
two pooled groups with regard to means on the 
mathematics test scores, holding intelligence 
test scores constant. 

Which pooled group was significantly more 
functionally competent in mathematics, holding 
intelligence constant? By applying a correction, 
an adjusted Davis mean was obtained for each 
pooled group. As shown in Table [X, the Davis 
mean for the college-bound group was adjusted 
from 117. 94 to 114.51 whereas the Davis mean 
for the other group was adjusted from 107, 29 to 
110.50, adifference in adjusted means of 4. 01. 
Hence, it was concluded that on the average, 
pupils who had planned to attend college were 
significantly more functionally competent in 
mathematics than were pupils who had not form- 
ulated such plans. 


Subsequent Tests Employed 





In the event assumption of homogeneity of 
variances of subgroups was not met, the writer 
employed the Behrens-Fisher d-test: 


Y, “ed Y, 


z(Y, - Y,)? + x(Y¥, - Y,)? 
N,(N, 1) N, (Nz ~ 1) 











where the type of quantities represented by the 
letters is obvious. Sukhatme’s table of d was 
employed in the significance levels. 

When two groups selected from a number of 
groups were being compared, it became neces~ 
ary to employ a t-test. Wishart’s adaptation of 
the t-test and the common t-test described by 
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Kenney and Keeping (6), the one used in the 
study, are equivalent when only one variable 
is held constant. The following formula was 
used to compute the standard error of a mean 
difference: 








‘ _ [20a = Ys" + 2a = Ya)? . (Ni + Na) 
6.5. « N, +N, -2 Ni-Na 


As Anderson (1) points out, caution must be ex- 
ercised in interpretation due to the higher sig- 
nificance levels imposed when means of select- 
ed samples are being compared. 


The Findings 


Not all of the minor comparisons were in 
complete agreement with the respective major 
comparisons. However, on the whole, it would 
seem reasonable to make the following general- 
izations from sample to population. On the av~- 
erage a pupil was significantly more function- 
ally competent in mathematics if: 


1. The pupil were a boy. 

2. The pupil had taken his elementary arithme- 
tic training in rural schools rather than in 
urban schools. 

. The pupil had planned to attend college rather 
than to terminate his academic training atthe 
end of the twelfth grade. 

. The pupil had taken more than two years of 
mathematics in high school rather than two 
or less years of mathematics. 

. The pupil’s grade-point average in mathe- 
matics placed him in the upper one-fourth of 
the distribution of pupils on the basis of grade- 
point average in mathematics rather than in 
the lower one-fourth. 

. The pupil’s academic record in high school 
ranked him in the upper one fourth of his 
graduating class rather than in the lower one- 
fourth. 

. The pupil were ina school enrolling more 
than 500 pupils rather than in one enrolling 
either less than 100 or between 100 and 500. 

. The pupil were ina school, the assessed val- 
uation of whose school district placed the 
school in the upper one-fourth of the distri- 
bution of schools on the basis of assessed 
valuation of school district rather than in 
the lower one-fourth, 

. The pupil were ina school, the average T- 
score (a number which took into account 
both the number of years of teaching exper- 
ience and the number of semester hours of 
mathematics preparation in higher institu- 
tions) of whose mathematics teachers placed 
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the school in the upper one-fourth of the dis- 
tribution of schools on the basis of average 
T-score rather than in the lower one-fourth. 


The following measures were calculatedfrom 
the examination scores of all the pupils in the 
sample: 


1. Mean I.Q., 103. 28; Standard Deviation 13. 
(Norm: 105; 15) 
2. Mean Davis Score 114.02;S.D. 15. (Norm: 
116; 16) 
. Coefficient of correlation between intelligence 
and functional competence in mathematics, 
61, 
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Recommendations 





It is recommended that results of research 
be made available to those directly concerned 
with their utilization, to the end that boys and 
girls may be more adequately trained in mathe- 
matics, not only for the precise purpose of train- 
ing mathematicians, but with the broader objec- 
tive of opening new worlds of thought and en- 
deavor to the layman and citizen of tomorrow. 

It is further recommended that grants be made 
by governmental and private agencies to encour- 
age further research of an experimental nature 
necessary to establish mathematics instruction 
upon a scientific basis and assure its continuous 
improvement. 
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AN APPLICATION OF THE FERGUSON 
METHOD OF COMPUTING ITEM 
CONFORMITY AND PERSON 
CONFORMITY 


H. M. FOWLER 
Ontario College of Education 
Toronto, Canada 


IN THIS paper, item conformity is defined 
as the product-moment correlation between the 
actual answer pattern of the item and the ‘‘ideal’ 
answer pattern, which is a function of the distri- 
bution of the total scores on the fest. Person 
conformity, which is similarly defined, refers 
to the relationship between the individual’s re- 
sponses to the test items and the responses of 
the group as a whole. Item conformities and 
person conformities may be computed by means 
of what has been called the ‘‘Ferguson Method’’. 
Some theoretical aspects and practical applica- 
tions of the Ferguson Method are discussed in 
this paper. To illustrate possible uses of this 
procedure, a report is given of anempirical 
study in which the method was used. The data 
used in the analysis were obtained by adminis- 
tering an experimental edition of a 74-item vo- 
cabulary test to a group of grade 8 pupils. To 
study the effect of eliminating non-conforming 
persons, item conformities are first computed 
for the total sample of 100 persons and then 
re-computed with the 18 least-conforming per- 
sons removed. Similarly, the effect of exclud- 
ing non-conforming items is studied by first com- 
puting person conformities for all items and 
then re-computing them with the 23 least con- 
forming items removed. The effect of removing 
both non-conforming persons and non-conform- 
ing items is also studied. Since the sample of 
persons used in the empirical study was small, 
and since the test items were designed to meas- 
ure only one type of achievement, the conclus- 
ions must be considered tentative and particular. 


1. Item Validity 





It is standard procedure in test construction 





to compute the difficulty and the validity of the 
items used in the trial runs. Difficulty is ordin- 
arily defined as the percentage of a specified 
group of students who get the item correct (or 
incorrect, if preferred). Validity has been var- 
iously defined. As a consequence, there area 
number of ways of obtaining estimates of valid- 
ity. At the Department of Educational Research] 
during the construction of the first experiment- 
al edition of an achievement test, we get an es- 
timate of curricular validity by comparing the 
item content with the course of study aS laid 
down by the Province of Ontario. Statistical es- 
timates of item validity can be obtained later 
either by comparing the item scores with the 
scores onsome criterion outside the test, or 
with the total scores of the test. It is usually 
necessary to use the total scores as the item 
criterion since reliable extra-test criterion 
measures are not often available. Some refer 
to the correlation between the item and the total 
test score as the ‘‘validity’’ of the item, but it 
seems to be more appropriate to think of itas 
the ‘‘conformity’’ of the item since it is a meas- 
ure of how well the item fits in with the total 
group of items. 2 Securing item conformity will 
not necessarily lead to high test validity if test 
validity is defined in terms of correspondence 
with an outside criterion. Nevertheless, most 
test constructors, whether by choice or not, 
look for item conformity first because item con- 
formity appears to be a prerequisite of test 
reliability and without test reliability no test 
validity is possible. 

There are a number of statistical procedures 
for computing estimates of item conformity. 
Guttman has introduced a technique for scaling 
items.3 Loevinger has proposed still another 





1. Ontario College of Bducation, University of Toronto. 


2. Item conformity is an indication of the discriminating power of the item, the ability of the 
item to separate the sheep from the goats. A good item is one which is passed by most of the 
“good" students, as determined by the total test, and failed by most of the "poor® students. 


3. Louie Guttman. "The Cornell Technique for Scale and Intensity Analysis,"® Bducational and Pay- 
chological Measurement, VII (Summer 1947), pp. 247-279. 
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approach to test construction. 4 In the Depart- 
ment of Research to get item conformity, we 
compute the correlation between the test item 
and the total test score by means of what we 
call the ‘‘Ferguson method’’.5 Readers will 
recognize that the Ferguson method of item an- 
alysis is similar to Guttman’s scaling methods 
and to the procedures discussed by Loevinger. 
The Ferguson method, although not widely dis- 
cussed in the literature, preceded the other 
methods. 


2, Item Conformity: Ferguson Method 





An estimate of the conformity of a test item 
that is scored as one for right and zero for wrong 
may be obtained by computing the product-mo- 
ment correlation between the actual pattern of 
response for the item and the pattern which 
ranks the persons in the same order as they are 
ranked by the total test score. Consider Figure 
1. 

The correlation between the scores of the ac- 
tual answer pattern and those of the ‘‘ideal’’ 
answer pattern is given by the following form- 
ula: 


Ria = PQ NW 


where P = the number of persons getting the 

item right 

Q = the number of persons getting the 
item wrong 

N = the total number of persons =P + Q 

W = the number of ‘‘misplaced’’ persons, 
which is given by either the number 
of persons below the point of dichot- 
omy 8 who get the item right or the 
number of persons above the point 
of dichotomy who get the item wrong. 


In this example, P=13 Q=5 
N=18 We2 


Therefore, Ria = {13)(5) - (18)(2) . 29 ~ | 45 
erefore ia 13)G) 55 
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That Rig is the product-moment correlation be- 
tween scores on the actual answer pattern and 
scores on the best answer pattern may be dem- 
onstrated as follows: 


Let X denote a score on the actual answer 
pattern and Y denote a score on the ‘‘ideal’’ 
answer pattern 


Then: 2X = 2X* = LY =ZY? =P (summing 
for the N individ- 
uals) 

and 2xXY=P-W 

ew je N2XY -2X-LY 

XY " V[NEX? - (2X) J[NZY* - (Y)*] 


_ _N(P = W) - P? 
~ V(NP - P2)(NP - P?) 


= P(N - P) - NW 
P(N - P) 











2 noe since N- P=Q 


Ria, the conformity of the item. 


The method just described may be used to 
analyse the items ofatest. Figure 2 shows 
part of an item analysis of a 33-item arithmetic 
test, obtained by analysing the responses given 
on 100 test booklets which were selected so as 
to be a representative sample of a larger sample 
of 1500 test booklets. Each item had five op- 
tions. 

In the item analysis given in Figure 2, indi- 
viduals appear in the columns, items in the 
rows. The initial work consists of tallying the 
responses made by each individual on each item 
of the test, where the individuals are arranged 
in descending order of the total score from left 
to right. If the response is correct, the square 
is left blank;? otherwise, a number from one to 





4. Jane Loevinger. "a Systematic Approach to the Construction and Bvaluation of Tests of Ability® 


, LxI (1947), PP. 1-49. 


_. "fhe Technic of Homogeneous Teste Compared with Some Aspects of Scale Anal- 
yeie and Factor Analyeis,* Paycholocical Bulletin, XLV (November 1948), pp. 507-529. 


6. @. a. Ferguson. The Reliability of Mental Tests (London: University of London Press, 1942). 


6. The point of dichotomy of an item is obtained by counting from right to left the number of 
squares indicated by Q; for example, in Figure 1, the point of dichotomy is the point which 
divides the 18 squares into two parte so that 13 squares are on the left and 5 squares are 


on the right of that point. 


7, & dlank square indicates # correct item, which is goored "1"; in Figure 1, the scores of the 


items were shown. 
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k (the number of alternatives), showing the 
option marked in error, or the letter ‘‘O’’, 
showing an omitted item, is recorded. The 
computing shown at the right of Figure 2 is 
Simple. The value of W is obtained by count- 
ing the blank squares to the right of the point 
of dichotomy, and this value can be checked by 
counting the number of error (and omission) 
squares to the left of the point of dichotomy. 
The measure of the conformity of the item, Ria, 
is obtained by substitution in the formula given 
above. 

When item responses are plotted as shown 
in Figure 2, the analysis provides very useful 
information. First, if desired, estimates of 
the seductiveness of the incorrect options of any 
item may be obtained by studying the distribu- 
tion of incorrect responses. Second, the diffi- 
culty of the item is given by P. For example, 
the difficulty of the first item shown in Figure 
2 is 76%, since 76 out of 100 pupils got the it- 
em right. Third, an estimate of the conform- 
ity of the item is given by Rig. 

It is clear that the conformity coefficient is 
merely an index of the agreement between the 
actual answer pattern and the ideal answer pat- 
tern of the item. It shows how well the item 
agrees with the total test score, that is how 
well the item ‘‘fits in with’’ or ‘‘conforms with”’ 
the total of the items. This coefficient can be 
used during the early stages of the construction 
of a test to denote those items which constitute 
a homogeneous, and presumably a reliable, 
group of items. 

The items which are most satisfactory for 
use in a test are those which have the difficulty 
and the conformity which best meet the purpose 
of the test. No general rules will apply in all 
cases. It has been found, however, that for 
many purposes the most satisfactory item is 
one which has a difficulty level at or near 50% 
and a conformity coefficient of .20 or higher. 
These criteria levels are arbitrary and should 
not be slavishly followed. If one were able to 
select items solely according to their conform- 
ity, the selection might be made by ranking the 
items by conformity and choosing the most con- 
forming items. In actual practice, however. 
other things must be considered, such as the 
difficulty of the items and whether the available 
items conform enough for the purpose at hand. 


3. Person Conformity 


If, in an item analysis, the plotting is done 
so that the individuals appear in the rows, with 
items in the columns, it is possible to use the 
Ferguson method to compute what may be called 
‘*person conformities’’. The data for the anal- 
ysis are the same as those which may be used 
to analyse the items; only the arrangement of 
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the data is different. The computing is similar 
to that shown in Figure 2: P is the score of the 
individual, rather than the difficulty of the item, 
Q is the number of errors and omissions made 
by the individual, and W is the number of ‘‘mis- 
placed’’ items, defining a misplaced item as an 
easy item failed by a high-scoring individual or 
a difficult item passed by a low-scoring individ- 
ual, The coefficient, Ria, is the correlation 
between the actual answer pattern of the person 
and his ‘‘ideal’’ answer pattern. It is an index 
of the agreement of the individual’s pattern of 
response with that of the group as a whole; it 

is a measure of his conformity. 

A non-conforming person is an idiosyncratic 
person in the sense that he does not respond to 
the item as he would be expected to respond 
considering his total score on the test. Persons 
with relatively high total scores who fail many 
of the easy items are non-conforming; persons 
with low total scores who pass many of the dif- 
ficult items are non-conforming. Person con- 
formity, then, is relative to a particular group 
of items. 

Person conformities, unlike item conform- 
ities, are not used to assist the test constructor 
to select items. They do, however, have defin- 
ite possibilities as aids in helping the test theo- 
retician to understand intricate item-person 
relationships. Also they would seem to have 
value for the teacher or guidance counsellor as 
aids in diagnosing individual difficulties. Why 
is a certain person non-conforming with respect 
to the responses he makes on a certain test? 
An examination of the item pattern may provide 
invaluable clues for the remedial treatment of 
the student’s difficulties .n the area represent- 
ed by the test. 


4. Empirical Results for Items and Persons 





One purpose of this paper is to examine the 
changes which occur in item conformities and 
in person conformities under various conditions 
of test sampling. The first step was to compute 
the conformities of the items of a 75-item a- 
chievement test in vocabulary at the grade 8 
level. The data for the analysis were obtained 
by selecting a representative sample of 100test 
booklets from a group of approximately 400 test 
booklets completed by Ontario students, tested 
in May, 1946, during the construction of a bat- 
tery of achievement tests. 

The conformity coefficients ranged from .56 
to -. 15 with an average of .27. A number of 
the items of this experimental edition of the 
test did not agree with the total score, inpupil 
placement, sufficiently well to be included in 
revised editions of the test. Approximately 
fifty of the items had conformity coefficients of 
.20 or higher so that, on the basis of conformity 
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alone, those items below .20 might be elimin- 
ated if a 50-item test were desired. 

The second step was to compute person con- 
formities; a person analysis was completed by 
re-arranging the vocabulary test data of the it- 
em analysis. The person conformities ranged 
from .67 to .13 with an average of .41. From 
this it would appear that person conformity es- 
timates tend to run higher than item conformity 
estimates. It was decided arbitrarily to class- 
ify as non-conforming the eighteen persons 
whose conformity estimates were below .30. At 
present the level can be chosen only in terms 
of its convenience because very little research 
on person conformity has been completed. An 
inspection of the table of person conformities 
showed that a relatively high proportion of the 
non-conforming persons were among those with 
the lowest test scores—ten out of eighteen per- 
sons with conformities below .30 appeared in 
the bottom quarter of the group, as determined 
by total test score, and fourteen appeared in 
the bottom half. Is it generally true that per- 
sons with low scores are less conforming than 
persons with high scores? This is a question 
worth investigating in future research. ® 


5. Sampling of Persons and Items for Com- 
puting Item Conformities 





It is a tenet of test theory that a reliable test 
is built by obtaining a group of homogeneous 
items. But if we are going to use conformity 
as a criterion in deciding whether to retain or 
eliminate an item, we should be sure that we 
are following a sound procedure in getting an 
estimate of conformity. Was the item-correl- 
ation low because the item incorrectly misplaced 
conforming persons (item conformity), or be- 
cause it quite properly misplaced non-conform- 
ing persons (person conformity)? Should the 
sample of persons providing the data for esti- 
mating the conformity of the item be chosen at 
random, after providing for representation of 
all segments of a certain population, or should 
it be made up only of conforming persons ? 

Since the item conformity coefficient is ob- 
tained by computing the correlation between it- 
em performance and total test performance of 
a specified group of students, the size of the 
correlation depends upon the distribution of the 
total scores, which in turn depends to some ex- 
tent upon the manner in which the students in 
the item analysis sample were sclected. If only 
students of a restricted range of ability are used 
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to provide data for the analysis, the conformity 
coefficients of all the items will likely be re- 
duced. Also, since the total score is obtained 
by summing the results of the individual items 
it might happen that the apparent non-conform- 
ity of an item is due to the non-conformity of 
the other items with which it is associated. If 
the items as a whole are not ranking the stud- 
ents in the proper order, the good item will not 
show a high correlation with the incorrect rank- 
ing. The correlation of conformity depends up- 
on the innate merit of the item itself, upon the 
value of the total items in aggregate, and upon 
the selection of the sample used in the item 
analysis. 

It'is the policy of the Department to admin- 
ister the first experimental edition toa large 
representative group of Ontario pupils, prefer- 
ably at least one thousand. When the tests have 
been scored the booklets are arranged in de- 
scending order of total test score and an item 
analysis sample is selected by taking every 
tenth, eleventh or twelfth booklet so as to get 
one hundred or two hundred booklets. We use 
two hundred booklets whenever we can but the 
labour involved in the item analysis sometimes 
makes it impractical to use as many as this, 
particularly if there is a large number of items 
in the test. 

Since the sample used in the item analysis is 
sometimes relatively small, the make-up of the 
sample is a matter of considerable theoretical 
importance. What students should be used? It 
is obvious that the sample should be represent- 
ative of the population for which the test was de- 
veloped: experimental editions of the test should 
be administered to large representative samples 
and the item sub-sample should be chosen ina 
representative manner. Other questions con- 
cerning what is good practice remain. For ex- 
ample, should ‘‘non-conforming’’ students be 
eliminated from the item sample? 

In interpreting item conformity statistics, 
the question arises as to whether the non-con- 
formity of certain items is due to the presence 
of non-conforming persons in the item analysis 
sample or due to innate weaknesses in the items 
themselves. To examine this question, the it- 
em conformities were re-computed after the 
eighteen least-conforming persons were removed 
from the sample. After this was done, the it- 
em conformities of the most-conforming items 
were computed a third time after both non-con- 
forming persons and non-conforming items 
were eliminated. 





8. The correlation between total test score and person conformity estimate was .34, which is 
significantly different from sero at the 1% level of significance. The total test scores 
of the 100 persons in the analysis ranged from 66 to 19, 
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6. Comparison of Item Conformities 





The data consisted of different estimates of 
the conformities of the 74 items (one item was 
discarded because of scoring difficulties) ac - 
cording to the samples of persons and items 
used in computing these estimates. The follow- 
ing data were available: (1) the item conform- 
ities which were obtained when 100 persons and 
74 items were used in the item analysis; (2) the 
conformities of the same 74 items whenconform- 
ities were computed from data provided by the 
most-conforming persons of the original item 
analysis sample—the 18 least-conforming per- 
sons (as judged by the person analysis) were 
removed from the sample and the item conform- 
ities were then re-computed; (3) the item c on- 
formities of the 51 most-conforming items (as 
judged by the original item analysis) whenthese 
conformities were estimated from data provid- 
ed by the most-conforming persons. 

When the item conformities obtained under 
the three conditions of sampling were compared, 
it was noticeable that item conformities on the 
wile change very little when non-c onforming 
persons, non-conforming items, or both, are 
removed from the item analysis sample. Of the 
74 items, 42 were conforming—had an r of .20 
or higher—regardless of the make-up of the 
sample. Furthermore, many of the changes 
that do occur are small, no greater than might 
be expected from the reduction in the size of the 
sample. 9 

The removal of the non-conforming persons 
resulted in a general reduction of the conform- 
ity estimates and an increase in the number of 
non-conforming items. The average of the cor- 
relation coefficients was reduced from . 28 to 
.25; the number of non-conforming items was 
increased from 23 to 28. Only thirteen of the 
74 items changed their conformity status: 9 it- 
ems which were conforming became non-con- 
forming and four, which were non-conforming 
became conforming. The nine items whose con- 
formity dropped below . 20 had original con- 
formities of .31 or lower. If .32, rather than 
.20, had been used as the critical level of con- 
formity, these items would have been non-con- 
forming in the original item analysis. Thus it 
would appear that non-conforming items can be 
eliminated either by setting a fairly lowcritical 
level of conformity for a sample of conforming 
persons or by setting a higher critical level of 
conformity for an unselected sample of persons. 
Rather small, and probably not statistically sig- 
nificant, increases in conformity occurred for 
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four of the items. 

Most of the larger changes in the conformity 
estimates appeared to be due to the removal of 
non-conforming items. The item conformities 
obtained for the 82 most-conforming persons 
and the total of 74 items agreed very closely 
with those obtained for the 82 most-conforming 
persons and the 51 most-conforming items. We 
may tentatively conclude that when the number 
of persons is held constant, a moderate elimin- 
ation on non-conforming items will not greatly 
affect the conformity of the remaining items. 

The practical implications of the above ten- 
tative conclusions are: (1) Obtaining an item 
analysis sample by selecting a representative 
sample from a large group of test booklets ap- 
pears to be satisfactory; there is no need to 
eliminate non-conforming persons at the outset, 
but, as a safeguard, if the number of trialitems 
warrants it, the selection level could be raised 
from .20 to .30. (2) The presence of very poor 
(non-conforming) items in a test will not greatly 
affect the conformity of the other items; inother 
words, the item analysis done on the first ex- 
perimental ‘‘run’”’ of a test will provide useful 
conformity estimates of the better items, esti- 
mates which should not be very different from 
those which might be obtained from later runs. 


7. Comparison of Person Conformities 





The available data, consisting of different 
estimates of the person conformities according 
to the samples of items and persons used, were 
as follows: (1) conformity estimates for 100 
persons when 74 items were used; (2) conform- 
ity estimates for 100 persons based on the 51 
most-conforming items; (3) conformity esti- 
mates for the 82 most-conforming persons com- 
puted from the 51 most-conforming items. 

The average person conformities for these 
three conditions of sampling were respectively 
.41, .34, and.38. This suggests that the effect 
of eliminating non-conforming items is to re-~- 
duce the size of the person conformities, where- 
as the effect of eliminating the non-conforming 
persons is to increase slightly the conformities 
of the remaining (conforming) persons. The per- 
centages of non-conforming persons for the 
three sample situations were 18%, 36%, and 
23%. Thus it would appear that the percentage 
of non-conforming persons is increased when 
the non-conforming items are removed, but it 
is decreased when the non-conforming persons 
are eliminated. A comparison of the person 
conformities with the item conformities sug- 





9. Reduction of the item analysis sample from 100 to 82, which makes the sample more select, 
ie undoubtedly a factor in bringing about some reduction in the size of the conformity 
estimates, which are correlation coefficients. 





244 JOURNAL OF EXPERIMENTAL EDUCATION 


gests that person conformities tend to run high- 
er than item conformities. 

The most striking feature of the effect on 
person conformity of removing non-conforming 
persons is that the conformities for persons 
show a considerabie amount of stability in the 
face of changes in the analysis samples:, those 
that were definitely conforming remain conform- 
ing, and those that were definitely non-conform- 
ing remain non-conforming. Of the 100 persons 
in the original sample, 58 show conformity esti- 
mates above .30 regardless of the type of per- 
sons or items used in the analysis. 

It is clear that the effect, in general, of re- 
moving non-conforming items from the analysis 
is to raise the standard of person c onformity. 
In other words, on a selected group of items 
fewer people will be conforming according to a 
pre-designated conformity level, than on 
an unselected group of items. As is often the 
case, the correlation between two variables (in 
this case the persun answer pattern and the ideal 
answer pattern) is decreased when the variability 
of one of the varia}les is decreased. 

When non-confuyming persons were removed 
from the analysis jample, only 7 out of 82 per- 
sons changed theirj}conformity category. Thus 
we may say that wen the number of items is 
held constant, an qimination of non-conforming 
persons will not greatly affect the conformities 
of the remaining p4rsons. This agrees with a 
corresponding tenixtive conclusion for items. It 
appeared, howeve, that items (persons held 
constant) were somewhat more stable than per- 
sons (items held c¢nstant). 


8. Reasons for Item Non~Conformity and Per- 
son Non-Confuirmity 








Because of the interaction between persons 
and items it is difficult to decide whether non- 
conforming persons produce non-conforming it- 
ems or whether non-conforming items make 
persons non-conforming. Is non-conformity 
due to the interaction of items and persons or 
is it due to outside factors? Are items non- 
conforming because of weaknesses in their con- 
struction or because of peculiarities in the re- 
sponses of some of the persons in the item anal- 
ysis sample? A more detailed examination of 
some of the non-conforming items and non-con- 
forming persons may help to clarify these points. 

A list of the 23 least-conforming items to- 
gether with data relating to them was prepared. 
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For each of the 23 items the conformity coeffic- 
ient, the number of misplaced persons, and the 
number of responses to each option were tabu- 
lated. A study of this table suggested some hy- 
potheses concerning the reasons for item non- 
conformity. In some cases one or mcre of the 
incorrect options was weak; in other cases the 
options had double meanings or were not all 
grammatically parallel. One item appeared to be 
measuring both spelling and vocabulary, which 
may account for its non-conformity. Some it- 
ems were too easy—where the item is correctly 
or incorrectly answered by most of the pupils, 
the reliability of the conformity coefficient is 
low. 10 Since most of the non-conforming items 
appeared to be structurally inferior, their non- 
conformity would seem to be largely the concern 
of the test constructor rather than of the person 
who is interested in the idiosyncracies of indi- 
viduals. 

Let us now examine the reasons for person 
non-conformity. Are persons non-conforming 
because of the presence of non-conforming items 
in the test or for some other reason? The ans- 
wer to this question will interest all teachers 
who are diagnosing weaknesses of pupils and 
applying remedial techniques. A table was pre- 
pared which showed the conformity estimates of 
the 18 least-conforming persons based on 74 it- 
ems, the estimates based on 51 items, andother 
information concerning the total score on the vo- 
cabulary test, the total number of items mis - 
placed, the number of easy items misplaced, 
and the number of difficult items misplaced. 

From this table, it was noted that person 
non-conformity appeared not to be due to the 
presence of non-conforming items. One per- 
son, for example, misplaced 15 easy items and 
15 difficult items, but of these only 13% of the 
easy items and 33% of the difficult items were 
non-conforming items. His conformity, which 
was .13 for the test as a whole, dropped to . 09 
when the non-conforming items were removed 
from the test. In general, the persons who were 
non-conforming for the test as a whole were just 
as non-conforming for the purified test made up 
of conforming items only. We may conclude 
that for the most part person non-conformity ap- 
pears not to be due to item weaknesses. There 
are, however, some persons who may be ad- 
versely affected by non-conforming items. 

It was also apparent that to the extent that 
person non-conformity may be due to item weak- 
nesses, the more difficult items are more 





10. Thie arises because the conformity coefficient is not independent of the difficulty of 
the item. When an item ie very easy or very difficult, changes in the obtained conform 
ity coefficient ae great as .20 or .30 can be expected "by chance*. This has been veri- 
fied by unpublished research completed by the author. 
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blameworthy thanthe easy items. For example, 
of fifteen easy items (difficulties ranging from 
95% to 64%) ‘‘misplaced”’ by one person, two 
were non-conforming items; of fifteen difficult 
items (difficulties ranging from 64% to 14% ‘‘mis- 
placed’’, five were non-conforming items. In 
other words, of the non-conforming items, those 
that seem most likely to add to person non-con- 
formity are those that are difficult for the group 
as a whole—apparently because of weaknesses 
inherent in the items, both good and poor stud- 
ents choose the incorrect options. 


We may summarize by saying that one cause 





FOWLER 245 


of person non-conformity may be item weakness 
or item non-conformity, but this may not be the 
only or even the most important cause. What, 
then, are the other factors which cause a student 
to get easy items wrong or difficult items right? 
What causes students to make the mistakes they 
do make? To what may we attribute the ‘‘indi- 
viduality’’ of a person? In the case of an) objec - 
tive-type test, part of the non-conformity may 
be nothing more than a function of his guessing 
the correct answers. On the other hand, there 
may be a ‘‘real’’ characteristic of the person 
which might be revealed by a case study analy- 
sis. The problem is worth further investigation. 








TABLES FOR TRANSMUTATION OF ORDERS 
OF MERIT INTO UNITS OF AMOUNT 
OR SCORES 


KENNETH E. ANDERSON, ROBERT T. GRAY 
EINAR V. KULLSTEDT 
School of Education, University of Kansas 


THE FOLLOWING tables are adapted 
from a table presented by C. L. Hull* in 1922 
for the purpose of changing orders of merit, 
or ‘‘ranks, ’’ into normalized scores. In its or- 
iginal form this table contained corresponding 
values of ‘‘percent position’ and normalized 
scores. The ‘‘percent position’’ was defined as 


100 (R - .5) 
N 


where R is the rank of the individual in the ser- 
ies and N is the number of ‘individuals ranked. 
By means of this table, then, it was possible to 
provide a set of normalized scores ona given 
characteristic for a group of individuals by first 
ranking them on the characteristic, then trans- 
forming the ranks into percent positions by the 
formula, and finally obtaining from the table the 
corresponding normally distributed scores. 

In order to obviate computing the percent po- 
sitions of the individuals of a group when itis de- 
sired to find their normalized scores, the fol- 
lowing tables were developed. They contain the 
normalized scores corresponding to every rank 
in groups of all sizes from 1 to 100 individuals. 
In order to find the normalized score for a given 
individual it is necessary only to find the table 
column corresponding to the number of individ- 
uals in the group and the table row correspond- 
ing to the rank of the individual in the group. The 
score will lie at their intersection. For example 
suppose an individual ranks 8th in a group of 35 
persons with respect to a given characteristic. 
Locating the table column corresponding to ‘‘size 
of class’’ equal to 35 and the table row corres- 
ponding to ‘‘rank in class’’ equal to 8, we finda 
value of 66 at their intersection. This value is 
the score, out of a possible 100, which would 
theoretically be made by the 8th ranked individ- 
ual in a group of 35, if the scores were norm- 
ally distributed. 





Tables I - VI as adapted from Hull are based 
on a range of ranked ability arbitrarily cut off 
at a plus and minus 2.5 standard deviations. 
The baseline of his curve is 5 standard devia- 
tions and each of the 100 parts is equalto 0. 05 
standard deviations. Thus a rank of 2 in 50 
gives a percent position of: 


P= 100(R-.5) =3.00 
N 


Translated according to his table we obtaina 
score of 86. 

Table VII gives the normal equivalents of 
ranks in groups of all sizes from 1 to 25, where: 


T = 50+ 10 (X - M) 
S. D. 


Thus, a rank of 1 in 25 has a percent position 
of: 


P=100(R-.5) = 2.00 
N 


Referring to the unit normal curve, we obtain 
a x/o of 2.05. Thus the normalized equivalent 
of a rank of 1 in 25 is: 


T = 50+ 10 (2.05) = 70.5 or 71. 


If one wishes to extend Table VII for groups 
higher than 25, calculate the percent positionas 
before, look up the x/o value ina unit normal 
table, and calculate the normalized equivalent. 
For example, ina group of 31 individuals, we 
have: 

% 
Rank Position x/o T Score 
1 98. 387097 2.14 71.4 
2 95. 161291 1. 66 66. 6 





*c. L. Hull. 
pp. 385-390. 


"The Computation of Pearson's r from Ranked Data," Journal of Applied Psychology, VI (1922), 
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TABLE 0 


ORDERS OF MERIT INTO UNITS OF AMOUNT 
(ACCORDING TO HULL) 
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3-91. 935485 
4 88. 709679 
5 85. 483873 
f 
1 


3 


A constant may be subtracted each time to ob- 
tain the next percent position as follows: 





1/31 = 3. 225806 
98. 387097 - 3.225806 = 95. 161291 


The use of scores translated into units of 
amount according to Hull for purposes of correl- 
ation, will produce slightly higher correlations 
than when T scores according to the unit normal 
curve are used, 











A EMPIRICAL INVESTIGATION OF THE PROB- 
LEM OF DISPROPORTIONATE FREQUENCIES 
IN ANALYSIS OF COVARIANCE AS APPLIED 

TO A METHODS EXPERIMENT 


DAISY STARKEY EDWARDS and SIDNEY J. PARKIN 
University of London 


THE EXPERIMENT to be described en- 
countered difficulties familiar to many who un- 
dertake research in the classroom. They arise 
from the fact that it is seldom possible, in prac- 
tice, to attain an experimental structure exactly 
conforming to the requirements of statistical 
theory. In England, the yearly entry of 100 or 
more into a secondary school is often divided in- 
to ‘‘streams’’ of different intellectual ability, 
and there, for instance, it is not likely that a 
‘‘random sample’’ for experimental purposes 
will be found ‘‘in situ’’ in the classes of any giv- 
enage. Rearrangement in randomised groups 
as envisaged by Lindquist (1) could so disrupt 
school organisation that the co-operation of those 
in authority would not be readily forthcoming 
even though they were favorably disposed towards 
experimental work. 

If the differences are almost certain to be due 
to unequal average mental capacity of the indi- 
viduals in the various classes, or some other 
measurable quantity, they may be overcome by 
using the method of analysis of covariance in 
cases where the normal methods would be the 
analysis of variance. A second difficulty, not 
disposed of so easily, is that many designs re- 
quire rather rigid restrictions on the number of 
cases in the experimental units. If, again, the 
convenience of the schools is studied, and the 
classes are not altered in numbers, appreciat- 
ing the fact that existing methods based on exact 
theory lead to prohibitively laborious calcula- 
tions, evidence on the nature of the error likely 
to be committed in using numbers only slightly 
departing from the ideal theoretical condition 
would be valuable. Both matters are considered 
in a statistical handling of the data of the exper- 
iment. 

It was desired to compare the relative effects 
of each of three methods of presenting diagrams, 
in otherwise similar lessons, to childrenof first 
year secondary school age. In order that prev- 
ious training should not affect the results, the 
lesson chosen was on a topic not likely to have 
been met previously by the children, and involved 
illustration. 

The topic chosen was ‘‘The Construction ofa 
Hand Sewn Slipper’’ and all groups were given 





the same lesson by Mr. Parkin. The lesson 
could be conveniently illustrated by means of 
three line diagrams. The first showed the shape 
and names of the parts which are sewn together 
to make the upper of a simple slipper, the sec- 
ond was an ‘‘exploded’’ line diagram showing 
the upper on a last and the various components 
which have to be attached to complete the slip- 
per, and the third was a perspective line sketch 
of the finished slipper with the toe cut away to 
show in section the relative positions and mode 
of attachment of the various components. The 
lesson, which took about fifteen minutes, follow- 
ed the order of practical procedure of slipper 
construction and a short recapitulation was made 
when explaining the third diagram. 

So far as was experimentally possible the only 
difference between the lessons lay in the method 
of presenting the diagrams. In one case they 
were built up freehand one at a time on the black- 
board, in the second case they were presented 
in the form of projected images from three film- 
slides, and in the third case they were present- 
ed in the form of three prepared wall charts. 
Both wall chart and filmslide diagrams consisted 
of white lines on a black background so as tocon- 
form with the blackboard diagrams, and the size 
of the illustration presented in all cases was kept 
as uniform as possible. The experimenter 
took care to build up the board diagram in the 
manner in which he would normally illustrate a 
lesson. He considers that he is competent, but 
not exceptionally gifted, in blackboard work. 

In England, allocation of children into various 
types of secondary school takes place at the age 
of eleven, approximately, mainly on the basis of 
a selection examination which in the district con- 
cerned consists substantially of tests of intelli- 
gence and attainment in Arithmetic and English. 
Roughly, their order in the examination places 
them intoGrammar, Technical or Modern 
Schools, though parents occasionally suggest 
that they wish a child to go into a Technical 
School instead of a Grammer School even though 
his performance entitles him to a place in the 
latter. Since the education offered in these 
types of school is different, the results of the 
experiment might be affected by this fact, and 





258 JOURNAL OF EXPERIMENTAL EDUCATION 


all types of school had to be included in the in- 
vestigation. There was also the possibility that 
boys and girls might have different results, so 
that it was necessary to include girls’, boys’ 
and co-educational schools, and also it was ar- 
ranged that the main comparison, between meth- 
ods, should in each of the three cases, have ap- 
proximately equal numbers of boys and girls. 
To allow for inequalities in intelligence of the 
children in the various classes, a preliminary 
short intelligence test consisting of items on 
order, correspondence and analogies was ad- 
ministered, This test is not available in pub- 
lished form, but was used in a County selection 
examination elsewhere, and had been carefully 
standardized. Without entering into any discus- 
sion on the factorial contents of the test, it was 
eventually found that its results had a significant 
correlation with those of a short test given im- 
mediately after the lesson on the facts taught in 
the lesson, used as the final measurment in the 
experiment, and it was concluded that its effect 
should be eliminated from the methods of com- 
parison. The methods were, withinschools, 
and with one small exception to be described 
later, allocated at random to the classes, 

One requirement of the two-way comparison, 
with schools and methods as the main effects, 
could not be met exactly. This is the numerical 
restriction which requires proportionate frequen- 
cies in each combination of method and school. 
The numbers actually made available are given 
in Table I. 

The elimination of the intelligence variate 
was to be attained by the use of analysis of co- 
variance. Analysis of covariance is a refine- 
ment of analysis of variance, designed to cover 
certain cases in which the simpler technique 
could not secure sufficient control of error. It 
is usually used in cases comparable to the pres- 
ent one, where intrinsic variations in the exper- 
imental material give rise to heterogeneity in 
the results which is irrelevant to the investiga- 
tion, and neglect to take account of these may 
lead to negative or even wrong conclusions. Our 
preliminary measurement of intelligence of the 
individuals in the experimental groups, estimat- 
ed by the experimenters as an attribute of the 
individuals which will almost certainly affect 
the results, is used to make an adjustment to the 
final readings, provided that it is verified dur- 
ing the analysis that the assumption of its effect 
is correct, 

We give a brief account of the rationale of 
the method of analysis of covariance which, itis 
hoped, will clarify the method of treatment of the 
data obtained in this experiment. It may be said, 
in essential, to be a way of allowing for the lin- 
ear regression of the final variate, y, on the one 
or more initial variates x,,x,....,X,, by carry- 
ing out an analysis of variance of the deviation 
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of y from its regression estimate. Consider 
the case p= 1. Let ytg be the t-th observation 
in the s-th group, and y'ts be its deviation from 
the sample mean, similarly for the x’s. Suppose 
we are considering the simplest design of exper- 
iment, a ‘‘between-within’’ comparison between 
groups. Taking a null hypothesis of no differ- 
ence between groups, and using the least squares 
approach with corresponding population param- 
eters, it is easy to prove that an F-test is ap- 
propriate. 

The form of the F ratio indicates the follow- 
ing computational procedure. First, split the 
sums of squares, =x'?, Ly'*, Ux'y' into ‘‘be- 
tween groups’’ and ‘‘within groups’’ components; 
the three components of the latter provide the 
error term, the denominator of F. The mathe- 
matical procedure having shown that this is 
X(y' - bx')*, within groups, divided by an appro- 
priate number of degrees of freedom, we com~ | 
pute the value of this square sum as Zy' ~@x'y) 

2x' 


The numerator of F is found to be based on the 
difference of a quantity similar to the above com- 
puted from sums and products for totals andthe 
‘‘within groups’’ square sum. Unlike analysis 
of variance, there is no alternate procedure for 
the latter using ‘‘between groups’’ squares and 
products, as this would result ina quantity 
mathematically less than or equal to the above 
difference. 

If the mathematical procedure were carried 
out in more complicated cases, it would be found 
that the following rule of thumb procedure is the 
appropriate generalisation of the above findings, 
for one independent variate x only, in any design 
where the group frequencies are proportionate. 


1. Use at all stages of the analysis that x=(y' - 
bx')* = Z(y"") - b?2Z(x"’) where b is the re- 
gression coefficient appropriate to the par- 
ticular summation. 

. Split up the sums Z(y"*), (=x'*), Z(x'y') into 
components appropriate to the problem and 
design of experiment, but 

. Instead of immediately computing Z(y' - bx')? 
on each line combine the last line of the three 
columns of this preliminary analysis (error) 
with the components corresponding to each 
main effect in turn, and each interaction in 
turn, calculate 2(y' - bx')* on each of the 
new lines formed, calculate this sum also 
on the error line and fill in Z(y' - bx')? on 
the original lines by 

. subtraction of the error value from the val- 
ue of this sum on the line representing the 
combined effect of error and the appropriate 
main effect or interaction. 


In case of any doubt, for instance in the pres- 
ence of significant interactions, the mathemati- 
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TABLE I 


NUMBERS OF CHILDREN AVAILABLE IN VARIOUS GROUPS 





School 
School Filmslide Wallchart Blackboard Total 





Secondary Modern 

Boys 26 67 
Selective Central 

Mixed 31 79 
Girls Grammar 32 91 
Boys Grammar 27 79 
Secondary Technical 

Boys 28 84 
Secondary Technical 

Girls 25 71 





Method Total 471 





TABLE I 


PROPORTIONATE GROUP FREQUENCIES 





School 
School Filmslide Wallchart Blackboard Total 





18 
21 
24 
18 
21 
18 





120 
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cal analysis may always be used, but experience 
is often a very good guide. An example of the 
form of calculation will be given later. 

The method would, of course, only be used 
in a case where a significant regression of y on 
x is found. If significant differences are pres- 
ent between class means, the regression found 
from the total sums of squares and products is 
apt to be misleading, and judgment of the useful- 
ness of the method is usually based on the re - 
eression coefficient calculated from the en- 
tries on the error line. The presence of two or 
more initial variates x significantly related to y 
involves carrying columns in the analysis to con- 
tain the sums of squares and products of the ad- 
ditional variates, and involves working with par- 
tial regression coefficients b,,b,,...bp). The 
quantity subtracted from z(y"*) on any line to 
give X(y' - b,x} ~ byx...)° is a generalised 
form of b*2Z(x'*), (which may also be written 
bu (xy) ), i.e., b,2(x,y) + bgX(xpy) +... + bp 
(Xpy), the relation to the general theory of re- 
gression being immediately obvious. 

As has already been stated, the dispropor- 
tionate frequencies create a difficulty. Follow- 
ing Snedecor’s suggestion (2) we may show that 
there is no significant deviation of the frequen- 
cies from proportionality, (<? = 2.79, P = 99%) 

To check a possible effect of disproportion- 
ality, we propose to compare the results of an- 
alysis in which this is disregarded with one in 
which certain of the observations have been 
dropped by random choice. The extent of theal- 
teration in the experimental numbers is shown 
by comparing Tables I and I. 

This method of approach is unconventional. 
In a case as Close to proportionality as that pre- 
sented by the original data, instead of following 
the laborious procedure of recalculating totals 
corresponding to expected frequencies, as sug- 
gested by Snedecor (2), or the method of fitting 
constants suggested by Yates (3), and Stevens (4) 
illustrated in the British Journal of Psychology, 
Statistical Section (5), an analysis of the com - 
plete set of data has been carried out as though 
its frequencies were proportional, and the re- 
sults have been compared with those of the first 
analysis. Educational data frequently present 
very small departures from proportionality, and 
the writers feel that it is desirable to present ex- 
perimental evidence of this type on the errors 
likely to be committed on an assumption of pro- 
portionality. No generality is claimed for this 
empirical approach, though the writers are of 
the opinion that the extent of the departure of 
their data from proportionality is fairly typical 
of much educational data. 

The primary results obtained in the analysis 
of exactly proportional figures are shown within 
the heavily outlined rectangle in Table III. 

From the error line of this analysis we obtain 
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an estimate, .127, of the ‘‘within groups’’ cor- 
relation between y and x, and with 382 d.f. this 
is evidently significant compared with the P = 
.05 value of . 100 and almost as much as the P 

= .01 value of . 132 computed from the standard 
error of zero correlation for this number of de- 
grees of freedom. 

The analysis of y* with the effect of x elimin- 
ated according to the method previously outlined 
is shown in Table II, and the significance of the 
results is recorded in Table VI. It should be 
observed that, owing to the use of ‘‘b’’, a func- 
tion of the observations, in computing the final 
sum of squares, the number of degrees of free- 
dom of the error term is one less than it would 
have been in the analysis of the y? values alone. 

Figures in the seventh column of Table I 
are obtained by subtracting the figures in the 
sixth column from those in the fourth column, 
i.e., Ud® = Ly" -b*dx'*. The F values in 
parentheses are obtained by using the interac- 
tion variance instead of error variance against 
which to estimate the main classification effects. 
Any discrepancies in the figures in this and fol- 
lowing tables are due to the fact that the figures 
used in the original calculations have been round- 
ed off to two decimal places for the purposes of 
this article. 

The analysis of the whole set of observations 
is shown in Table IV. 

The data originally collected in the experi- 
ment included results from a Secondary Modern 
Girls’ School. Owing to a technical hitch in 
this school the numbers being taught by the three 
methods were very disproportionate to those in 
the other schools. They were, respectively, 
18, 19 and 9 in methods 1, ll andill. For all 
results including these X? = 7.753, P = 80%, 
12 d.f., and we include the analysis of this data 
obtained with an even more unsuitable set of 
frequencies for comparison with the other two. 

Even with the extremely disparate numbers 
in the three experiments we obtain values of F 
for which the probabilities are very similar. 
Table VI illustrates this point. 

Finally, to support the findings regarding 
method differences, we include for interest a 
3 x 3 Latin Square analysis which was deliber- 
ately incorporated in three of the schools anal- 
ysed. This represents a small departure from 
the statistical requirement that the allocation of 
the methods in each school should be random, 
but it is not evident that such a collective ran- 
dom allocation to the three schools should make 
a serious difference to the results. It is per- 
haps more important to notice that these three 
schools were Grammar and Central Schools and 
that the results of analysing their results are 
thereby restricted to children of fairly high in- 
telligence. Since, however, these three schools 
were all divided into three streams, and one of 
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TABLE Vil 


LATIN SQUARE RESULTS 





uare 
Source um 
of Variance » & s.s. 


Methods 314. 67 





Intelligence (Streams) 108. 22 
Schools 70. 98 


Error 6569. 12 








TABLE VUl 


VALUES OF ‘‘t’’ 





Methods __ Signif- Methods Signif- Methods Signif- 
1 and 11 icance 11 and 111 icance landilll_ icance 





Proportionate Frequencies 1.21 n. Ss. 5. 71 1% 4.41 1% 


Disproportionate 
Frequencies (6 schools) . 56 Ss. . 19 4.97 1% 


Disproportionate 
Frequencies (7 schools) .01 
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the main effects comparisons is “between streams’ ’, 
while the ‘‘methods’’ comparisons to some ex- 
tent eliminate streams and therefore intelligence 
differences, the results of the analysis are of in- 
terest as they probably illustrate the increase 
in precision gained by using analysis of covari- 
ance which attempts to eliminate the effect of in- 
telligence more exactly. 

The significance values for F for two degrees 
of freedom are 3.0 at 5%, 4.6 at 1% and 6.9 at 
.1%. We see that the difference between methods 
is not significant at the . 1% level, which is not 
an unexpected result as compared with those ob- 
tained by analysis of covariance. 

Since the experiment was undertaken witha 
view to finding out whether the teacher should, 
for best immediate results, use one method or 
another, an investigation was made into the dif- 
ferences between the methods means taken two 
by two. The adjusted means for filmslide, wali- 
chart and blackboard methods, respectively, 
were 25.45, 24. 62 and 28. 27 for the proportion- 
ate groups; 25.31, 24.97 and 28.24 for the dis- 
proportionate groups (6 schools), and 24.91, 

24. 90 and 27. 83 for the disproportionate groups 
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(7 schools). 

The significant differences are all in favour 
of the blackboard presentation. We conclude, 
therefore, that on the evidence presented, 


1. Results of an analysis carried out by the 
method of analysis of covariance are not 
greatly affected by a moderate degree of 
disproportionality of the figures even 
though the method used is appropriate to 
strictly proportionate numbers. 


. Blackboard illustration built up in the 
course of a lesson by a reasonably com- 
petent illustrator gives superior immediate 
results to those obtained by illustrating 
with wallcharts or filmslides. 


It is of interest to note that the results of a 
delayed recall test presented one month after 
the lesson to all test groups confirmed the super- 
iority of the blackboard although the actual differ- 
ences between the methods means were smaller 
than in the immediate recall test. 
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ESTIMATING COMPONENTS OF VARIATION 
IN AN EXPERIMENTAL STUDY 
OF LEARNING 


WILLIAM HARRISON LUCOW 
Lord Selkirk School 
Winnipeg, Manitoba, Canada 


Introduction 


THE PURPOSE of the investigation 
in a self-contained experiment was to examine 
the variances arising from two approaches to 
the learning of introductory high school chemis- 
try, and to present the formulas and calcula- 
tions by which the components of variance and 
their fiducial and confidence limits might be es- 
timated. The differences in variance were test- 
ed for significance, and Model II analysis of 
variance was used to determine the components. 

The change in variance from pre-test toafter 
test of the criterion examination was consider- 
ed to be of greater import than the change in 
means, under the assumption that greater var- 
iance in a group indicated greater expression of 
individual differences. Where a manufacturer 
might wish to eliminate factors that contribute 
to the variance of his product, the democratic 
educationist might wish to avail himself of fac- 
that contribute to the variance in performance 
of his pupils. This experiment was a study in 
variation and an example of the application of 
techniques and formulas appropriate to the de- 
termination of the components of variance ina 
composite population. 


Learning Methods 





The contrasting methods of learning chemis- 
try were both ‘“‘real’’ in the sense that both 
could be found in operation generally in the high 
schools of Manitoba. One was a textbook-cent- 
ered approach and the other was a laboratory- 
centered approach. The distinction was one of 
emphasis rather than of abstraction. 


Population and Samples 





Two distinct populations were chosen, and 
the experiment was run separately for each. 
One population consisted of ‘‘accelerated”’ stud- 
ents who followed a course designed for univer- 
sity matriculation; the other population consist- 
ed of ‘‘non-accelerated’’ students taking a course 
of not sufficient immediate credit for university 





entrance. 

The classes of 1952-53 were taken as sam- 
ples and shown to be representative of the fore- 
going populations. The sample of accelerated 
students numbered thirty-six, eighteen of which 
followed the textbook-centered approach, and 
eighteen the laboratory-centered approach. The 
sample of non-accelerated students num bered 
twenty-four, which made up a group of twelve 
for each approach, 


Design 


Each group was randomly divided into s ub- 
groups of three. Thus, two treatments (meth- 
ods of learning) were administered to six repli- 
cates (random sub-groups) each containing three 
individuals. The measure of any individual, ex- 
pressed in the form of a score, Xpij, was a 
combination of the effects of treatment, repli- 
cate, interaction of treatment and replicate, and 
experimental error. Expressed in the form of 
an equation, 


Xhij = + Oh+ Bi+ a Bhi + Cnij 


where u is the general mean, >} is the effect 

of the hth treatment; §; is the effect of the ith 
replication, and a @pj is the effect of the hith 
interaction between treatment and replication. 

€ hij is the effect of general experimental error. 
In this study, the parameters of major interest 
were not the foregoing variables, but their var- 
iances: 


—_ op: .. g and og respectively. 


They are the components of variation. 
For the accelerated pupils, Xhij was any 
score according to the following design: 


Laboratory 
Method 


Sub- Textbook 
group Method 





I X11 112 113 X211 X212 X213 
1 Xi2i X122 X123 X221 X222 X223 
I Xi31 X132 X133 X231 X232 X233 
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X142 143 X241 242 243 
X%252 X253 
X262 X263 


IV X41 
V <Xi51 4152 153 X251 
VI X161 X162 X163 X261 


The analysis of variance table for the acceler- 
ated pupils ( six sub-groups, two treatments, 
and three individuals in each sub-group) is giv- 
en on page 268. 

A similar design was developed for the non- 
accelerated pupils (four sub-groups, two treat- 
ments, and three individuals in each sub-group). 

The design was based on the randomized 
complete blocks model given by Anderson and 
Bancroft. (2) 

The E(MS) column in the Model II analysis of 
variance table mentioned above, represents the 
expectation of the mean square in the population. 
For the error variance, 0%, the average value 
of the mean square of the three measures with- 
in each sub-group is taken. The expectation of 
the mean square of interaction between the two 
treatment means and six sub-group means is: 


o8 + 2x6 g?, 


2 2 
Je 30 
t 9 <6 € . tb 


The expectation of the mean square of the two 
treatments means is: 


3x26 0 = - + 30? + 180? 
2x6 or oe 


a a 
0 
e* tb * 


3x2x6 ,? 
2 t 


The expectation of the mean square of the sub- 
group means is: 

a 3x26 2 Sx2x6 3s CB 2 2 
Je + ox6 —— = 6 SD = % + 35tb + 60}, 
These calculations follow the rule enunciated 
by Crump. (5) 


Statistics Relating to the Estimated Components 





Formulas for the standard error of estimate 
fiducial limits, and confidence intervals are pre 
sented with reference to Table I, asymbolic 
table for use in computing component statistics. 

The standard errors of estimate are given 
on page 268, 

The following formulas are given by Ander- 
son and Bancroft (2) who use the concepts of 
Fisher and Bross. The population fiducial lim- 
its for a variance component are: 


(Fo/F})- 1 52 
Fi (Fo/F)- 1 


(Fo/F2) - 1 
Fy (Fo/F») = J 








Od or< 
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The population confidence interval for a vari- 
ance component is: 


(Fo/F,) - 1 Aa 


(Fo/F9) - 1 
Fo-1 


Aw 
a*<¢ o* 
Fo -1 Cor 


In the foregoing formulas, Fo is the ratio, 
Vi/V4, obtained from the data. Fy is F g5 with 
nj and ng degrees of freedom, which equals 
1/F. 05 with ng and nj degrees of freedom. (Note 
the reversal in the order of degrees of freedom 
when stating F, 95 in terms of F_ 95. In order to 
find the proper values in the F-table, the degrees 
of freedom must be used in the sequence given. ) 
F2 is F, 05 with nj and ng degrees of freedom. 
F is F, 95 with nj and w degrees of freedom. 
F'9 is F_ g5 with nj and w degrees of freedom. 

It is not uncommon that the value of a compon- 
ent should turn out negative. When this occurs, 
the value is taken as zero. 


Practical Application of Model II (9) 





Model II follows exactly the same pattern as 
Model I in the analysis of variance table up tothe 
end of the mean square column. In Model II there 
is added the column listing expressions expected 
to be equal to the mean square in the population. 
Thus, if repeated samples had been analyzed to 
yield mean squares between methods, for in- 
stance, the average of these mean squares would 
be taken as the expectation of the mean square in 
the population. 

Table II shows the analysis of variance of the 
results on the criterion after-test by the acceler- 
ated groups. The unknowns in the expressions 
in the last column of the table consist of variances 
which are considered to be the components of var- 
iation in the population. Estimation of these com- 
ponents is accomplished by equating the mean 
squares to their expectations and solving for the 
unknown variances. 

Starting at the error line, the mean square of 
the error forms the estimate of the population 
error variance, 





A2 _ 
d= 395. 1111. 


The interaction variance estimate in the popu- 
lation is obtained by subtracting the error mean 
square from the interaction mean square and di- 
viding the difference by three. Thus, 


tb = (659. 7167 - 395.1111) +3 = 88.2019. 


The treatment (methods of instruction) vari- 
ance estimate in the population is obtained by 
subtracting the interaction mean square from 
the treatment mean square and dividing the dif- 
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TABLE I 


SYMBOLIC TABLE FOR USE IN COMPUTING COMPONENT STATISTICS 





Source d/f 


Mean Square 


Expectation MS 





1 n, =(r 
2 Ny = (c 
3 Ng = (r 


ng = (k 


2 2 2 
a, + KO}, + cko), 
o? + kot, + rkof 

2 
o* + kor), 


0 





TABLE 0 


COMPONENT ANALYSIS OF VARIANCE TABLE FOR THE CRITERION 
EXAMINATION AFTER-TEST OF ACCELERATED PUPILS 





Sum cf 
Source Squares 


Mean 
Square 


Expectation MS 





Blocks 1934. 1389 
Treatments 6. 2500 
Interaction 3298. 5833 
Error 9482. 6667 


Total 14721. 6389 


386, 8278 
6. 2500 
659. 7167 


395.1111 


2 2 2 
Of + Sort 607, 


o* + 30 + 180f 


2 
CG. + 30, 


0 


e 
2 
e 
e 
e 
2 
e 
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Analysis of Variance Table 


Source da/f 
Sub-groups 5 


Methods 1 


E(MS) 
MSB o%, + Sof, + 6of, 


MST 08 + 30%, + 1807 


Ms (TB) 


MSE 


Standard Errors of Estimate 


2 
Soé ome 2 Sop 








é 2 
1 [v3 , 24 








Ng +2 Ng + 2 


rk 


ference by 18. Thus, 5f = (6.2500 - 659. 7167) 
# 18 = -36.3037. (This estimate is taken as 
zero. ) 

The blocks variance estimate in the popula- 
tion is obtained by subtracting interaction mean 
square from the blocks mean square and divid- 
ing the difference by six. Thus, 5, =(386. 8278 
~ 659. 7167) + 6 = -45.4815. Again, since the 
result is negative, the estimate of this variance 
component is taken as zero. 

Applying the formulas for standard error, 
fiducial limits, and confidence intervals to the 
interaction component, 04), = 88.2019, the fol- 
lowing values are obtained. Taking the compon- 
ent estimate as 88.20, the standard error is 
4.93, the fiducial limits are (0.00, 787.88), and 
the confidence interval is (0.00, 864. 47). 


The Use of Components of Variance 





The components of variance associated with 
the treatment effects (methods of instruction) in 
this study turned out to be not significantly dif- 
ferent from zero when compared with the error 
component. This situation suggests that the 
sources of experimental error be examined and 
reduced as far as practical. The principal 
sources of error magnitude may be in the de- 
sign itself, particularly in the size of sub-plots 
taken as replicates. It may be worthwhile to 
investigate the possibilities of a design that 
would make use of entire classrooms as sub- 
plots. 








k Vv ng+2 ng+2 


so? = + 2v,_, 2V3 
b ck n 








1+2 ng+2 


A knowledge of components of variance may 
be used to reduce variation where variation is 
not wanted or to increase variation where vari- 
ation is wanted. These changes in variation 
may be effected by altering the conditions in- 
volved in the situation that produces variation. 
A knowledge of components may help to lend 
precision to the process of altering variation, 
and this knowledge may lead to insight into the 
operation of factors in learning situations both 
singly and in various combinations (interactions). 
Thus, new promise is added to educational ex- 
perimentation. 


The Measuring Instrument 


The criterion examination in chemistry was 
developed during a pilot study a full year pre- 
vious to the commencement of the experiment. 
This examination was valued at 216 marks di- 
vided equally among three objectives: (a) re- 
call of basic concepts, (b) application of con- 
cepts and principles, and (c) comprehension 
and interpretation. The type of item through- 
out was the multiple-response variety of the 
multiple-choice form, presented to the pupils 
as ‘‘the whole truth and nothing but the truth’’ 
type of item. The sample item read: 


Three times five is more than 
(a) 5 
(b) 10 
(c) 15 
(d) 20 
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ACCELERATED PUPILS NON-ACCELERATED PUPILS 
Textbook Laboratory Textbook Laboratory 
Group Group Group Group 


Pre- After-]| Pre- After- Pre- After- Pre- After- 
Test Teast Test Test Test Test Test Teat 






































Figure 1 


Mean + One Standard Deviation for All Groups on the Pre-Test and 
Ater-Test of the Criterion Examination 
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The pupils were instructed that (a,b) was the 
only acceptable response, and that any combin- 
ation from one to all choices might be the re- 
sponse called for in the items of the examination. 
The examination was administered as a pre-test 
and as an after-test. 


Comparison of Pre-Test and After-Test 
Variances 





The t-test used to compare variances was: 


tee = 817 5g) WN) 
Si ~ S83 








Vag - r*) (sj . 85) 


The confidence interval for a coefficient of 0.95 
is given by the expression: 


8} (K - VK* = f) 3% £(K + VK? - 1) 
85 % 


where K= 1+ “ul J t andt is taken from 
n- 


the t-distribution with n - 2 degrees of freedom 
by the requirement, P( /t/ = t) = 0. 95. 

The variance of the textbook group of accel- 
erated pupils increased from 113. 28 in the pre- 
test to 404. 25 in the after-test. According to 
the t-test, this increase was significant at the 
one percent level. The ratio of the variances, 
404. 25/113. 28 was 3. 5686 with a 95% confidence 
interval of (2.1137, 6.0248). This meant that 
the ratio of the after-test variance to the pre- 
test variance in the population could be as low 
as 2.1137 to as high as 6.0248. With a signifi- 
cance level of 0.05, all hypothesized ratios of 
the variances smaller than 2.1137 and larger 
than 6. 0248 would be rejected. Thus, and as in- 
dicated by the t-test, a ratio of one or the hy- 
pothesis that the variances were equal would be 
rejected, 

The variance of the laboratory group of ac- 
celerated pupils increased from 110.58 to 461.36, 
an increase significant at the one percent level. 
The variance ratio was 4. 1724 with a confidence 
interval of (2. 1246, 8. 1944). 

With the non-accelerated pupils the difference 
was marked. The variance of the textbook group 
of non-accelerated pupils actually decreased 
from 150.27 in the pre-test to 140. 81 in the 
after-test. The t-test showed this difference 
to be non-significant. The variance ratio was 
1.0671 with a confidence interval of (0.5063, 
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2.2489). 

The variance of the laboratory group of non- 
accelerated pupils increased from 85.66 to511.17 
an increase significant at the one percent level. 
The variance ratio was 5. 9675 with a confidence 
interval of (2.6097, 13.6450). 

Figure 1 shows the results of the criterion 
examination in terms of means and standard de- 
viations. The increase in means from pre-test 
to after-test was in all cases highly significant 
(one-tenth of one percent level). With the accel- 
erated pupils increase in variance was signifi- 
cant for both groups. With the non-accelerated 
pupils, there was no significant change in var- 
iance with the textbook group, but a significant 
increase in variance for the laboratory group. 


Summary and Conclusions 





This paper has indicated the formulas and 
procedures that might be used in analyzing the 
variance in educational test results. The pro- 
cedure is that of Model II analysis of variance, 
suitable for the determination of variance com- 
ponents. 

The experimental results show that in the pop- 
ulation sampled of high school chemistry pupils, 
the brighter, or accelerated pupils, increase in 
variance as groups whether they use the textbook 
approach or the laboratory approach to the study 
of chemistry. 

The less bright, or non-accelerated pupils, 
on the other hand, profit more from the labora- 
tory approach insofar as increase in varianceas 
a group is concerned. 

The writer recommends that the laboratory 
approach be used for all pupils. 

It is here suggested as a postulate in educa- 
tional philosophy that great variation inclass- 
room achievement is evidence of the release of 
individual differences among pupils during the 
learning process. This study has shown that one 
method of learning high school chemistry is more 
effective than another method in producing vari- 
ation in results (in the case of the non-acceler- 
ated pupils). If the foregoing philosophical posi- 
tion is accepted, then this investigation points 
to two lines of research. The first is the com- 
parison of methods of instruction in various sub- 
jects and at all school levels in order to discover 
which methods yield greater variation in results. 
The second line of research might be the re-ex- 
amination of studies that have been made, and 
that have had a suitable statistical design, for 
indications of methods that yield greater vari- 
ance. 
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A PROCEDURE FOR ANALYZING A TEST AND 
MAXIMIZING ITS RELIABILITY 


ANGUS G. MACLEAN and ARTHUR T. TAIT 
California Test Bureau 
Los Angeles, California 


THIS PRESENTATION ofa procedure 
for analyzing a test and maximizing its reliabil- 
ity is divided into three parts: (1) overview and 
computational procedure, (2) a fictitious example 
to illustrate the procedure, and (3) review of 
basic concepts. 

‘Reliability’ here refers to internal consis- 
tency or homogeneity. It is assumed that all it- 
ems are scored 1 or 0, and that the items are 
not differentially weighted. 

The technique produces in one operation all 
data necessary for: 


(1) Ascertaining item ‘difficulties’, so that 
items may be placed in order of difficulty, 
and the desired distribution of test scores 
(symmetric about the mean, negatively 
skewed, etc. ,) may be ovtained by select- 
ing items on the basis of difficulty. 

(2) Scrutinizing items on the basis of their 
contributions to error of measurement and 
to reliability, in order to reject those it- 
ems which contribute more to error than 
to reliability. This procedure is more 
precise than the use of item-test correla- 
tion devices and appears to involve less, 
not more, labor, since much of the infor- 
mation is being obtained simultaneously 
(1) and/or for (3). However, point-biser- 
ial correlations may be obtained on the 
same work-~sheet by carrying out one ad- 
ditional operation. 

(3) Computing the reliability (internal consis- 
tency) of the complete set of items. 

(4) Computing the reliability of any desired 
subset of items, including that of the best 
subset from the point of view of reliability. 

(5) Obtaining the mean, variance, and stand- 
ard deviation of the test. 

(6) Calculating the inter-item correlations 
(if the test is not found to be homogeneous) 
with a view to further subdivision of the 
proposed test into highly homogeneous sub- 
tests by reclustering the items. 


The description below is in terms of IBM equip- 
ment, but equivalent hand procedures may be de- 
veloped to set up the F-matrix, which is no more 





than a frequency table. 


1. Computational procedure 
a. Punch all item scores (1 or 9) for eachin- 
dividual in the sample on IBM cards. 
b. Use an IBM sorter, electronic statistical 
machine, or tabulator to generate an F- 
matrix, defined as follows: 





row 1 and column 1 refer to item number 
1 and the cell-entries consist of the total 
number of correct answers, or frequency 
(f) of ‘‘passes’’; 


in row 1, cell 1, the entry records the 

number passing on item 1 alone; in row 1 
cell 2 (as in column 1, cell 2) is entered 
the number of persons who gave correct 
answers to both items 1 and 2. 


Thus, entries on the principal diagonal 
(top left to bottom right) give the ‘‘f-val- 
ues’’ of the n items in the test. These may 
be converted to p-values by dividing by N, 
the total number of cases. ‘‘P’’ isacom- 
mon index of item difficulty, used to place 
items in order of difficulty. Also, (p-p?) 
is the variance of the item, and /p - 

its standard deviation. 100p is the per- 
cent passing, and is sometimes used in 
reporting to avoid decimals. 

The side entries (non-diagonal entries) 
are analogous to cross~products in correl- 
ation. Divided by N they give the propor- 
tion of subjects who obtained correct ans- 
wers to the two items indicated by their 
row and column number. The two tri- 
angles formed by the non-diagonal entries 
will be symmetrical, i.e., {jj = fji. 

Diagonal entries will be denoted pjj and 
ey entries pjj, it being understood that 
i fj. 

. Sum all the entries (f-values) in each row 
and record each row total in a column de- 
noted as (a). These entries may be denot- 
ed Zfrj. 


d. Sum all the entries on the principal diag- 
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onal and record their total. This sum is 
denoted as Xfjj. 

. Ina new column, denoted (b), enter the 
oT or item-test covariance value for each 
row. The computing formula is 


oT * (fr; ~ Suet) 


where {jj is the diagonal entry in row i 
and Ufr; is the row total for item i already 
entered in column (a), 

Sum column (b) and record the total. This 
value is o*,,., the total variance or vari- 
ance of the whole test. 

. Obtain the corresponding 04; for each di- 
agonal entry, {jj, by 


, ; ra il 
on pit ~ pA = (cit) 


and record in column (c). 

. Subtract the entries (row by row) in col- 

umn (c) from those in column (b). Enter 

in column (d), These entries constitute 
o4j> i#¢j, or the sum, for item i, of 


its covariances with all other items. 
. Sum column (d) and record the total. This 
gives us oj,, or a times the total 
amount of ‘true’ variance present. Asa 
check, sum column (c). The latter total, 
when added to that for column (d), should 
equal the total for column (b), or 
or = a n Lo4j 

Record in the last column, (e), the differ- 
ences obtained by subtracting the entries 
in column (c) from those in column (dq), 

e., obtain Sj the selection index (for it- 
em i) by: 

Sj=Poiy- oy (iF). 

. All items contribute to both error and true 
variance but those items whose S-value is 
negative detract from the reliability of the 
test. The error component is always pos- 
itive, since it is the item’s variance, but 
the item itself may correlate positively or 
negatively with the other items. The in- 
ter-correlations of the items, or, more 
precisely, their covariances, are the sole 
source of true variance or reliability. If 
an item’s variance (p - p?) exceeds the 
sum of its covariances, = (pij ° Pi Pid)» 
i#j, with all other items in ‘he tes 
then that ite,n is actually lowering the re- 
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liability of the test. Indeed, some items 
have a negative covariance total. Many 
tests in actual use do contain such items. 
As an example of how drastic this effect 
may be, one proposed test of 25 apparent- 
ly homogeneous items, analyzed by this 
method, proved to contain 17 items whose 
S-index was negative. The reliability es- 
timated by the Kuder~-Richardson formula 
20, was .297. When these 17 items were 
struck out the remaining eight items pos- 
sessed a reliability of .550! The total 
variance was only halved, while the true 
variance actually increased because of the 
rejection, among items whose covariance- 
totals were smaller than their variances, 
of some items which actually had negative 
totals. 


i. Having reached a final decision as to 
which items are to be retained, new 
totals for columns (b) and (d), or total 
variance and true variance are obtain- 
ed, using only the chosen items. Then 
the reliability coefficient, rtt, based 
on these items, is computed: 

n-l OT 
where n = the number of items. 


Two qualifications are in order; the 
above formula, known as Kuder-Rich- 
ardson formula 20, gives the lower 
limit of the true reliability, rather than 
the reliability itself. It is nevertheless 
the most widely used estimator of in- 
ternal consistency. Secondly, the def- 
inition of reliability is 


rit = true variance 
t total variance 
yet the K-R formula requires the cor- 
rection factor _2 , implying that the 
ro 4 plying 
total for column (d) underestimates the 
true variance. However, reference 
has been made to it as ‘true variance’, 
for the sake of simplicity and lucidity. 
If Horst’s (3) recent formula is pre- 
ferred it is a simple matter to rank 
the item-difficulties (p’s) in descending 
order of magnitude, to multiply each p 
by its corresponding rank-number, and 
to sum the products to obtain Zipjj. The 
formula then is: 
2 2 
rtt = om F ° 
2 2 
0 T “a 
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where 


or = 2D ipij - Vpij(1 + Upii) 


To sum up: 


Column (b) provides the item-test 
covariances, and their sum provides 
the variance of the whole test, o%p. 


Column (c) records the item variances 
which can never exceed .2500, but 
which, for optimal discrimination, 
should approach .2500. Asa rule an 
item variance should not fall below 
.1000, because then the item would be 
so easy or so hard that it would not 
be very useful. An exception could 

be made, of course, for the first few 
and last few items. 


Column (d) records the inter-ite m 

covariance-sums, and its total is o4, 

which when multiplied by 5 , esti- 
n- 


mates the true variance of the test. 


Column (e) lists the S-indices which 
identify the items which are detract- 
ing from reliability, also those which 
are contributing little, and those which 
are the most desirable. 

Once the items are chosen compu- 
tations may be made for the mean, 
variance, standard deviation, relia- 
bility and standard error of measure- 
ment of the test. If it is desired to 
record the item-test correlations, the 
following formula may be used: 


MACLEAN - TAIT 





This is a point-biserial coefficient. 
(047 is the entry in column b for item 
i.) 

The mean of the test is equal to2pjj, 
or, since the sum of the f-entries on 
the principal diagonal is already re- 
corded, the mean = “fj . 

N 

The item ‘‘difficulties’’ are simply 
obtained by converting the entries on 
the principal diagonal to proportions. 

If for any reason inter-item correl- 
ations are desired, 


rij = _ Pij ~ Pi Pjj 
¥ (Pij ~ PRP); ~ P}j) 





where pj; is the proportion of subjects 
giving the correct response to item i, 
p;; the proportion correct onitem j, 
and Pij the proportion correct on both. 
This formula results in a phi-coeffic- 
ient. If the overall homogeneity proves 
to be low, these item-intercorrelations 
may be used to make a loose cluster 
analysis, in order to break down the 
test into homogeneous subtests. A 
formal factor analysis would usually 
involve too much labor for this pur - 
pose, and phi-coefficients are only 
roughly comparable for items of vary- 
ing difficulty.1 It is advisable to carry 
four places of decimals throughout. 





2. Example 


A fictitious five-item test yields the following information when the results of a try-out on 100 cases 
are processed: 


F-matrix Analysis 
2 3 4 §5 c 
66 59 31 23 
68 37 26 10 
37 61 30 24 
26 30 32 24 





1. If it 1 decided to perform such a re-clustering, it is advisable to correct the inter-item $'s 
by dividing each by ite corresponding $,,,. See Guilford (1). 
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23 10 24 24 25 106 . 3950 .1875 . 2075 . 0200 





2fjj = 266 926 2. 1844 1. 0206 1. 1638 . 1432 


M, = Upjj = 2. 66 Checks v / 


Computations for item 1 


: - g2 ] m — 
Column b: nT (or iT a On) 100 (259 - . 80 x 266) = . 4620. 


Check on total: 07, = at (926 - 2. 66 x 266) = 2.1844. 


Column c: Or .80 - . 80? = . 1600. 


Column d: uo, (covariance total for item 1) = . 4620 - . 1600 = . 3020. 


Check on total: oF, + LaF; = oT 
Column e: 3; . 3020 - . 1600 = . 1420. 


Check on total: USj + 2Uo{; = o7. 
item-test covariance 








Item-test correlation for item 1, a point-biserial r, = Vitem variance x test variance 


. 3020 
iT ~ ¥.1600 x 2. 1844 





= .911. 


Covariance between items 1 and 3 = pyg ~ Py1 P33 = - 59 - . 80 x . 61 = . 1020. 


Correlation between items 1 and 3, a phi coefficient, = is STEER = .§23. 





Computations for tne test as a whole 





, Lf 
Mean = py; =! = 2. 66. 
Pii N 


Variance ea 2. 1844. 
Standard Deviation = v 2. 1844 = 1. 48. 


Order of difficulty. correct as it is. 


Reliability (K-R 26): re 1.1638 _ 


(Horst) :Dipjj = (1 x . 80) + (2 x . 68)....ete., = 6.52. 


2 = 13.04 - (2. 66)(3. 66) = 3. 3044. 


Tm 
e 1. 1638 3.3044 _ 
Ttt = 373044 - 1.0206 21044 7°!) 
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Item selection 





On the basis of the S-indices, item 2 might 
be rejected. The new total for column a is: 
193 + 174+ 117 + 96 = 580. 


The new Xfjj is : 266 - 68 = 198. 
The new o2 = 5. 80 - (1. 98)* = 1.8796. 
The new Lof, = 1.0206 - .2176 = . 6 030. 


Then the reliability of the remaining items 
is, by the Kuder-Richardson formula: 


_ 4 1.8796 -.8030 _ 
= 2 1. 8796 + 





Omission of item 2 which has the negative 
S-index has raised the K-R reliability by about 
.10, even though it has dropped the total vari- 
ance. 

By the Horst formula, whenZip;j= 3.98 and 
of, = 2. 0596, 


_ 1.0766 2.0596 _ 
Ttt= {9566 1 e796 7° 9° 





This value represents a considerable in- 
crease in reliability over the value of .771 ob- 
tained prior to the deletion of item 2. 

As is well known, the K-R formula 20 under- 
estimates the true reliability when certain as - 
sumptions are violated, such as the assumption 
that the item difficulties are equal. Horst’s 
formula yields a coefficient corrected for the 
attenuation due to variation of the difficulties. 
It is seen in the above example how great this 
attenuation may be, even in 4 items whose dif- 
ficulties lie between . 80 and .20, i.e., whose 
difficulties are by no means extreme. 


3. Review of basic concepts 





The computational formulas given above 
were derived to facilitate working from the raw 
F-matrix, rather than from the item variance- 
covariance matrix into which it can be trans- 
formed. 

This latter is derived from F as follows: the 
item variances will lie along the principal diag- 
onals, and will be derived from the correspond- 
ing frequencies by: 


2 _ 2 fii ./ fii |? 
O:; = Pij a p c= es 
ii ii ii WwW IN 


The side entries will be the inter-item covar- 
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iances, 


a» = a» © se s3 = ij ii . fi 
ij N 


Then the following statements are true: 

The sum of the entries in row i, including 
the diagonal entry, is the covariance of item i 
with the total test. 2 

The sum of these sums, or the sum of all 
entries in the matrix, is the variance of the 
total test. 2 

The sum of the diagonal entries is the term 
Lof;, often written Xpq, the sum of the item 
variances. 

The difference between the last two terms, 
07 - Lofj, is oh , or the sum of all the inter- 


item covariances in the matrix. It is therefore 
the ‘‘amount of homogeneity’’ present. 

The sum of the entries in row i, excluding 
the diagonal entry, is the element that has been 
denoted as Lot , and the sum of these sums is 


o : boi has been referred to as the covar- 


iance -total for item i, and, if this is larger 
than the variance of item i, then i is evidently 
contributing more to the reliability of the total 
test than it is to error variance. If the reverse 
is true, then removal of the item will result in 
an increase in the reliability. 

The inter-item and item-test covariances 
may be converted into phi and point-biserial co- 
efficients respectively, by dividing them by the 
geometric mean of the relevani variances, since 
a covariance coefficient is equal to the corres- 
ponding correlation coefficient multiplied by the 
product of the standard deviations of the two 
variables concerned. The necessary variances 
are already available. 

Thus, the procedure, suggested above, of 
summing the raw frequencies across the rows 
and along the principal diagonal, and using mod- 
ified formulas, eliminates the tedious process 
of converting every element in the F-matrix in- 
to a variance or covariance term. In addition, 
the method as a whole is an especially power- 
ful and precise technique for analyzing an exist- 
ing test or for improving a test under develop- 
ment or revision. Also, it is highly desirable 
to have all data concerning a test computed in 
one operation and recorded on the same sheet, 
preferably on the IBM data sheet. Where this 
is done, all data concerning the test are mutu- 
ally consistent, and, if the initial punching is 
verified, the basic data are completely reliable 





2. These two statements are based on findings reported by Gulliksen (2). 
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Graw-Hill Book Co., 1950), p. 343. 


for all the different computations, which are 

too often done at different times by different 

workers, with minor discrepancies and much 2. Gulliksen, Harold. Theory of Mental Tests 

duplication of computation. (New York: John Wiley and Sons, 1950), 
pp. 376f. 
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THE LEAST-SQUARES ANALYSIS OF A pxqxr 
FACTORIAL DESIGN WITH UNEQUAL 
SUBCLASS FREQUENCIES’ 


RAYMOND O. COLLIER, Jr. 
University of Minnesota 


Introduction 


IN MANY instances experimenters are 
plagued with the problem of analyzing designs 
whichcontain unequal subclass frequencies. Un- 
fortunately, in the analysis of such designs 
having two or more factors, non-orthogonality 
of the effects makes it impossible to estimate 
them directly as is possible in the orthogonal 
case of equal subclass frequencies. 

One corrective practice consists of a random 
withdrawal of cases so that orthogonality of ef- 
fects is obtained. However, thereare situa- 
tions in which this procedure results ina ser- 
ious decrease in the error degrees of freedom 
of the design such that analysis is inadvisable. 
(This is particularly true if the power of the 
tests of significance are considered, ) Further- 
more, for many investigators this method of 
subclass equalization is equivalent to discard- 
ing information. 

The two-factor design involving unequal 
subclass frequencies has beenexplicitly treated 
by Anderson and Bancroft (1) and Kempthorne 
(3). Therefore, the purpose of this paper will 
be, first, to give some general explanations of 
the least-squares solutions of a three-factor de- 
sign and, thereafter, to present an actual ap- 
plied problem showing the analysis of sucha 
design. 


General Remarks 





For the three-factor (pxqxr factorial) design, 
consider the following model: 


(1) Yijkl= w+ p+ Bj+Mpenije Aict oojic+M jc € ijkl 


° -»P; j= 1, Bycces q;k=1, 
Besee , 2,.++, Nijk; is the fixed gen- 
eral effect; a, B, ¥ are the fixed main effects; 
and n, ¥,w and TT are fixed interaction effects. 
The €jjk] are assumed to be normally and 
independently distributed with mean 0 and var- 


where i = 1, 2, 
si=1 





iance o?. Hypotheses of interest are H,:aj;=0, 
H2:Bj=0, H3:7,=0, H4:njj=0, H5: 4, = 0, He: 
jk = 0, H7: Tjk = 9. 

Notice that the model (1) above includes a 
total of pr + pq + rq + pqr interaction paramet- 
ers. Even inthe simplest case possible, i.e., 
when p=q=r = 2, there are 20 interaction 
constants to be estimated—an imposing task. 

Instead, assume that the interaction param- 
eters are zero and rewrite the model as: 


(2) Yijki =» + &i + BY +TK + €ijKl 


and obtain the p + q+ r+ 1 estimates for p, aj, 
Bj and¥,. Furthermore, we will later use the 
complete model 


(3) Yijkl = 4 + Dijk + € ijk 


i.e., one which attributes to each ijkt® subclass 
a fixed effect. This provides the determination 
of the sum of squares due to all interactions— 
in other words a ‘‘pooled interaction’’ sum of 
squares. 

To estimate p, aj, Bj, and¥,, let us employ 
normal regression theory. The normal (least 
squares) equations are obtained by minimizing 
ifka® ijkl in (2) with respect to 4, aj, Bj and 
¥Y,- Thus, the normal equations can be written 
as functions of m, aj, bj, and cy, the estimates 
of py, aj, Bj, and, respectively: 


(4) 
minm+2nj aj+2Un ;, bj+2n c Y 
‘ i,. i j .j. Oj k ..k Ck ae 


aj: mi. m+ Mi, , Ai + Zimyj,dj + Ej, k ~ © %.. 
bj: n, j, M+ Dj, a4 +N, 5, bj+ DM, pe Cy 74. 
CkiM, .kM+ Nj. Ai + en, jkdj+m,  kCk= Y. Jk. 


where n=n,,, and the subscript ‘‘dot’’ on any 
term signifies a summation over that subscript, 





*The author is partioularly thankful to Dr. Cyril J. Hoyt, not only in prowiding the opportunity 
for considering this problem, but also for ocntinued encouragement. Deta used in the example 
constitute a portion of data collected in a study conducted by Dr. Raymond G. Price and Dr. Cyril 
J. Hoyt at the Bureau of Educational Research, University of Minnesote. 
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e.8-, Nj, = f Mijk: 


The equations of (4), however, are not lin- 
early independent. Therefore, to-solve unique- 
ly for these estimstes by any calculation meth- 
od, it is necessary to impose three linearly in- 
dependent restrictions on the parameters. Ex- 
cept for being independent these are quite arbi- 
trary. Notice, however, that if the restrictions 
for (2) 


(5) 2m, a4 = 0, om, 5, Bj = 0, En, KV_ = 0 


are usedwe have Y  _ /n=y. 


Also in (4), if the restrictions made on aj, bj, 
and ck are 


(6) Pnj., aj = 0, yn. by = 0, En. .k CK = 0 


thenm=Y___ _/n, the general mean, a result 
consistent with the case in which the subclass 
frequencies are equal. 

Now m, aj, bj, and cy may be determined by 
one of the many methods for solving simultan- 
eous linear equations. Suppose that we have 
solved for these estimates. With the present 
setup we will not be able to test the previously 
mentioned hypotheses, H4, H5, Hg, H7. Instead 
it will be possible to test Hj:nij= ¥ik =©jx = Tijk 
= 0. We know that the likelihood ratio test of 
this hypothesis involves finding the error sum of 
squares and the sum of squares attributable to 
this hypothesis. ‘The error sum of squares for 
this model is the ‘‘within subclasses’’ sum of 
squares, i.e., 


(7) SSE = iQ kl ye EY ed” /nijk 


From (3) we note that this third model is 
equivalent to that of the one-way analysis of var- 
iance and that the sum of squares due to the hy- 
pothesis: ijk = 0 is the ‘‘between subclasses’’ 


sum of squares (SSB) given by 


(8) SSB = > (PY ijkd*/niyk - (ei *ukD*/2 


It follows that the total sum of squares (SST) 


around the general mean, Y |. is given as 
n 


SST = SSE + SSB. 

From least-squares regression theory, the 
sum of squares of the estimating constants, m, 
aj, bj, and cy is 


(9) SS(m,a,b,c) = mY,,, + vai¥i, ” st Bn, 


+ pekY. : 





(Vol. XXII 


Therefore, the reduction in the sum of 
squares due to the hypothesis H4 is 


(10) SS H4 = SSB - SS(m,a, b,c) + SS(m) = 
ik Yj”, Nijk ~ SS(m,a, b,c). 


Now to obtain the sum of squares due to Hj: 
aj = 0, set the aj equal to zero in the normal 
equations, delete the a-equations, and solve for 
m, bj, bg, b3, bg, cj, ¢9, c4— the constants 
under the hypothesis. 


From (9), SS(m,b,c)=myY__. _+2bj *.h.* 
J 


ck Y..k. 
Consequently, 
(11) SSH, = SS(m,a, b,c) - SS(m, b,c). 


Similarily, by setting bj = 0 and then cj = 0, in 
the normal equations, and proceeding as before 
we find 


(12) SSH2 = SS(m,a, b,c) - SS(m,a,c) 
(13) SSH3 = SS(m,a, b,c) - SS(m,a,b). 


It is worthwhile to state at this point that if 
a calculation method known as pivotal conden- 
sation (see reference 2) is utilized, certain re- 
lationships are made obvious which may be ap- 
plied as a method for finding SSH), SSH2, SS 
Hg. In this method, various constants are 
‘‘swept out,’ i.e., deleted by adding certain 
multiples of equations. Thus, if in our normal 
equations both the aj and Lj constants are ‘‘swept 
out’’ from the c, equations, these remaining c- 
equations will be called the adjusted c (adjusted 
for a and b) equations. A sum of squares due 
to the adjusted c’s may be calculated which will 
be called the sum of squares of c adjusted for 
aandb or SS(cla,b). Similarly, SS(cl b) will 
refer to the sum of squares of c adjusted for b 
alone, etc. In general it may be shown that 


(14) SS(c| a, b) + SS(a] b) + SS(b) +SS(m) = 
SS(m, a,b, Cc) 

(15) SS(a} b,c) + SS(c |b) + SS(b) + SS(m) = 
SS(m, a, b, c) 

(16) SS(b| a,c) + SS(a| c) + SS(c) + SS(m) = 
SS(m, a, b, c) 


and also that 


(17) SSH, = SS(al b,c) 
(18) SSHg = SS(bi a,c) 
(19) SSHg = SS(cl a, b). 


It is clear then that equations (11), (12), (13) 
or (17), (18), (19) may be used to obtain the 





March, 1954) COLLIER 


TABLE I 
ANALYSIS OF VARIANCE 





Source of Degrees of Sum of Mean 
Variation Freedom Squares Square 





A(Hy: aj = 0) p-l SSH, 
B(H2: Bj ~ 0) q-l SSH 
C(Hg: ¥;, = 0) a SSH3 


I: Pooled Interaction pqr-p-q-r+2 SSH4 


(H'4: nij = Vik = 


Wj = Mijx = 0) 


Error 





TABLE I 


ANALYSIS OF VARIANCE FOR EXAMPLE 





Source of 
Variation Me ss ’ MS Decision 





A 1, 662,016.17 1, 662,016.17 ° Accept 
B 4,229, 771. 40 1, 409, 923. 80 . Reject 


Cc 863, 471. 26 431, 735. 63 Accept 
Error 31, 756, 570. 43 441, 063. 48 


Pooled Interaction 8, 480, 332. 79 498, 843.11 ° Accept 
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SSH, SS H2, and SS H3. 

The analysis of variance may now be written 
as shown in Table I. Because the interaction 
parameters were assumed to be zero inthe 
model (2), there exist tests for Hj, Hg, Hg only 
if H'4 is accepted. 


An Example 


The data used in the following example were 
collected in a typing-reading relationships study. 
rhe three factors included were: 


A: 2 levels of order of presentation of 
typing and reading materials 

B: 4 levels of high school typing classes 

C: 3 levels of typing tests 


The basic variable Yjjk| was a typing speed 
score for the ith individual of the ith level of 
A, the jth level of B and the kth level of C. 
From (4) we write the normal equations as shown 
at the top of the next page. 

If we sweep out the b’s in the m, a, andc 
equations and impose the restrictions shown on 
the next page, we can first solve for cj, c2, and 
then by ‘‘back'’ substituting, solve for m, aj, 
a2, by, bg, bg, and b4. This has been done, 
yielding as solutions: 


m = 4463.072917, a; = 132.817457, ag = -132.817457, 


by = -284.950488, be = 89.981636, bg = 246.767258, 
bq = -51. 435321, cy = 28. 798237, co = -121.084942 
cg = 106, 511481. 


, 


And from (9) we see that the sum of squares 
due to the estimates m, a's, b’s and c’s is 


SS(m, a, b, c) 


(4463. 072917)(428455) + (132. 817457)(220015) 
+ (~132. 817457)(208440) +... + (-106. 511481) 
(137539) = 1,918, 744, 217. 773958 


Calculations were first made of the ‘‘within 
subclasses’’ sum of squares using (7) and of 
the ‘‘between subclasses’’ sum of squares using 
(8). They were 


SSE = 31, 756,570. 4334, SSB = 14, 998, 643. 91, 
3S(m) = 1, 912, 225, 906. 65. 


Consequently, from (10) 
SSH‘, = 8, 480, 332. 79 


We should test Hy immediately by consider- 
ing F4 = (d. f. E)(SSH'4)/(d. f. H44)(SSE), which is 
distributed as Snedecor’s F with ny =d.f. H'4 
and ng = d.f.E. This test was made and the hy- 
pothesis accepted in our example. Thus it was 
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admissable to proceed in testing Hj, Hg and Hg. 

As previously mentioned, to test H; we set 
a, = ag = 0 in the normal equations, delete the 
a-equations, and solve for the m, b’s, andc’s. 
The solutions were made by sweeping out the b’s 
in the c-equations, solving for the c’s, andthen 
for m and the b’s. Actual calculation gave 


m = 4463.072917, bj = 63.149084, bg =241.663150, 
bg = 241. 663150, bq = -29. 075902, cy = 18.763892, 
cg = -119.945382, c3 = 115.923281 andSS(m, b,c) = 
(4463.072917)(428455) + (-275.670486)(117260) +. . 
_. + (115. 923281)(137539) = 1,917,082,201.60. 


Therefore, according to (11), SSH; =SS(m, a, b, c) 
~ SS(M, b, c) 
= 1,918, 744, 217. 77-1, 917, 082, 201.60 
= 1, 662,016.17. 


Similarly, in testing Hg we set bj = bg =bg= 
b4 = 0 and solve for m, the a’s and the c’s. 
They were 
m = 4463. 072917, a, = 118. 646549, 


ag = -118. 646549, cy = -120. 219977, 
c3 = 113. 650725, and SS(m, a, c) = 1,914,514,446.37 


Therefore, according to (12) 
SSH2 = SS(m, a, b, c) - SS(m, b, c) = 4, 229, 771. 40. 


Also, to test H3, we set cy = c2 = c3 = 0 and 
solve for m, the a’s andthe b’s. These were 


m = 4463. 072917, a, = 134. 104864, 
ag = -134. 104864, by = -284. 794693, 

b2 = 82. 398056, b3 = 250. 723802 

b4 = -49. 721053 andSS(m, a, b) = 1,917,880,753.97. 


Therefore, according to (13) 
SSH3 =SS(m, a, b, c) - SS(m, a, b) = 863, 471. 26. 


(Note: An illustration of the calculations for 
SSH1, SSH2, and SSH3 by means of (17), (18) 

and (19) will be available shortly in mimeographed 
form for interested readers. ) 


Finally, the analysis of variance is presented 
in Table I. 

One remaining point should be noted. Since 
the effects of this design were non-orthogonal, 
additivity of the individual sums of squares to 
the total sum of squares was not present. 


Summary 


The least-squares analysis of the general 
pXqxr factorial design with unequal subclass 
frequencies has been demonstrated. Explana- 
tions of certain interrelationships in the analysis 
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: 96m + 48a, + 48a2 20b2 + 29b3 + 19b4 + 32c] 30c3 = 428455 
: 48m + 48a] 8bo + 14bg + 1lb4 + 15cy 16c3 = 220015 
48m 48a9 12bg + 15b3 + 8b4+ 17cy l4c3 = 208449 


: 28m + 15a] + 13a2 + 9cy + 9c3 = 117260 


: 20m + 8a, + 12a9 20b9 6c} 6c3 = 90373 


: 29m + 14a] + 15ag +29bq + Qe, = 136566 
19m + llay + 8ag + 19bg+ 8c 5cq = 84256 
: 32m + 15a, + 17ag+ 9b, + 6bg+ 9bg+ Bbg + 32cy 143259 
: 34m + 17a, + 17ag + 10by + 8bg+ 10bg+ 6b4 = 147657 
: 30m + 16a, + 14ag + 9b, + G6bg + 10bg3+ 5Sbg 137539 
and restrictions from (6) are 
48a; + 48a9 
28b, + 20bg + 29bg + 19bq4 
+ 32c, + 34c9 + 30cg = 


have been made. This method ofanalysis McGraw-Hill and Co., 1952), pp. 278- 
has been appliedto actual data. The amount 284. 

of labor connected with an analysis of this 
sort is greatly reduced by using a systemat- 2. Collier, Raymond O., Jr. ‘‘Some Applica- 
ic computational layout, rather than by de- tions of the Method of Pivotal Condensa- 
riving algebraic formulas. tioninStatistical Analysis, ’’ Journal 
of Experimental Education, XXI (1953), 
pp. 233-241. 
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ON THE PROBLEM OF SAMPLE SIZE FOR 
MULTIVARIATE SIMPLE 
RANDOM SAMPLING’ 


WILLIAM J. MOONAN 
University of Minnesota 


0. Introduction 


A RESEARCHER who is attempting to 
set up a sampling design is faced with many 
problems. Not the least of these is the problem 
of deciding which of the variates shail be used 
to determine the appropriate sample size. This 
question is usually resolved either by 


a. determining the sample size from the 
variate which is considered most ‘‘im- 
portant, ’’ or 


. determining the sample size from each 
of several of the more important vari~ 
ates and then choosing the sample size 
from among these numbers. The max- 
imum number is the one which is com- 
monly chosen. 


There are certain disadvantages associated 
with either of the above procedures. First of 
all, it is often difficult to decide which variate 
is most ‘‘important’’ for purposes of the survey. 
In many surveys information on several variates 
is sought with possibly equal importance attached 
to each. For instance, a school survey may have 
as its objective the collection of information 
which relates to age, health, socio-economic 
status and learning ability. If this fact is rec- 
ognized and several sample sizes are determined 
we may find they are very discrepant, depending, 
in part of course, on the variance of the variates 
and the accuracy desired for estimating them. 
If the largest sample size is chosen, it is likely 
that some variates will be ‘‘over-estimated’’ 
with perhaps considerable trouble and expense. 
Furthermore, it must be remembered that the 
estimates are to be determined from the charac- 
teristics of the same randomly sampled individ- 
uals and, as a consequence, these values will 
not be independently estimated. It is the purpose 





of this paper to show how to find the sample 
size for surveys which sample several charac- 
teristics of the same individuals. 


1. The Univariate Case 





To determine sample size in surveys, there 
are two important circumstances to keep in 
mind. The first concerns the ‘‘accuracy’’ that 
the estimates must possess and the second is 
the probability that the accuracy will be obtained. 
By ‘‘accuracy’’ we mean the distance on the var- 
iate scale between the estimated value of a par- 
ameter and the parameter itself. Great accur- 
acy then signifies a small distance whereas 
small accuracy refers toa large distance. It 
would be of little value to set upa sampling 
scheme which intends to provide great accuracy 
and then proceed with the execution of this 
scheme if there was little chance that such ac- 
curacy would be achieved. Alternatively, noth- 
ing is to be gained by sampling with high degree 
of assurance that very small accuracy will be 
obtained. It is possible to bring both accuracy 
and confidence into play by determining sample 
size for single variates by considering the square 
of Student’s t. It is known that t-square is dis- 
tributed as Snedecor’s F with 1 and n - 1 de- 
grees of freedom, i.e., 


(1) F(1,n - 1; a) = (x! ~ 6") S** (x! - @'),! 


where a is the confidence coefficient, x! is the 
sample mean of the variate x’ based on a sample 
of size n, @' is the population mean and S:° is 
the inverse of the variance of x!; therefore, S'' 
= nS'* where S" is the inverse of the sample 
variance S,, . 

The value of a is used to reflect the confi- 
dence we desire to be associated with the accur- 
acy, d' = |x! -6'|. Since interest is attached 
to the value of n, (1) is rewritten as 





* This paper was suggested by some survey work at 
versity of Minnesota. 


the Bureau of Institutional Research of the Uni- 


1. Since we shall distinguish different variates by the use of superscripts, powers of numbers will 


not be used in this article. 


; 
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(2) n= F(l,n - 1; a)/d'S"d' 


There are some difficulties connected with 
the solution for n. The accuracy, d’, issome 
value that we arbitrarily specify, butS,, must be 
evaluated from 4 priori knowledge of the variate, 
a pilot study or {rom some method such as divid- 
ing the range of the variate by 5 or 6. Some er- 
ror is to be expricted at this point but fortunately 
it usually can bej/ kept small. Additionally, 
F(1,n ~ 1; a) depends upon the value of n which 
is unknown and pust be available in order to de- 
termine the F value. There appears to be an in- 
determinacy in yolving for n, but if we are will- 
ing to assume uit the magnitude of F does not 
change very muh for reasonable values of n 
which are encoujitered in practice, then the 
problem can be fesolved. It will be shown later 
how this problerg can be taken care of in a little 
better manner. ; 

To illustrate the calculations, suppose we 
are interested ij ascertaining the characteris- 
tics of high sch:jol chemistry students for 1952 
in a certain city} For the moment we shall be 
interested in a rieasure of their mental ability. 
During the acaddmic year there were 505 stud- 
ents who complexed eleventh grade chemistry in 
this city’s schouis, and as a part of the school 
testing program jthey were given the Otis Quick 
Scoring mental ability test. It is known that this 
test has approxijnately a variance, 0,, = 100for 
this universe. ‘et us be interested in estimat- 
ing the populatiaa mean within 2 points, i.e., 

d' = 2, witha confidence of .95. Substituting 
in (2) with n = ay we get, 


(3) n= 3. 84/2(. 01) 2 = 96. 


Next we find F(1, 96; .95) = 3. 95 and substi- 
tute this value for 3.84 in (3), thenn=99. This 
iterative process can be repeated until stability 
is reached. Ninety-nine is the stable value for 
this problem. If we had started using F(1,20; 
.95) we would have ended with n = 99 also. Since 
we are dealing with a finite population, the final 
sample size is given by no = nN/(N + n), where 
N is the universe size. For this problem‘no=83. 
With the aid of a roll of the students and a table 
of random numbers it is a simple matter to ob- 
tain the random 83 Otis scores and to evaluate 
their mean. This was actually done and x! = 
42.6 whereas, for comparison purposes, 6! = 
43.2. Therefore the desired accuracy was ac- 
tually achieved «s expected. 


2. The Multivariate Case 





The unfamiliar notation adopted in section 2 
was used in order to provide a smooth transi- 
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tion from the univariate to the multivariate 
cases. The generalization of Student’s t was 
made by Hotelling in 1931.2 This statistic is 
called T, and T-square, i.e., diSNdJ, is pro- 
portionately distributed as a variance ratio dis- 
tribution. In fact, 


(4) F(p,n-p;a)=(n-p)disii di/p(n- 1) 
i, Felis BS 


p is the number of variates, nis the sample 
size; a is the confidence coefficient; di =(x! - 6 

. . XP - @P) and dj is the transpose of this ma- 
trix; S) is the inverse matrix of the variance- 
covariance (varcovar) matrix of the x!. Con- 
sequently, S!J = nS) where SiJ is the inverse of 
the varcovar, 8jj, of the variates. 

It is possible to solve for n in (4); nis the 
largest root of the following quadratic equation 
in n: 


(5) n [disiid)] n - np [disiid) + F(p,n-p;a)] + 
pF(p,n- p;a)=0 


If we are willing to replace nSij by (n - 1)S4J in 
(4), then the value of n is 


(6) n=p [1 + F(p,n - p; «)/disiid)] . 


For practical applications the Sjj must be 
determined by methods which were mentioned 
before. Also, we notice that F depends on n. 
Therefore, we must successively approximate 
nina manner similar to that shown in the last 
illustration. The arithmetic procedures can be 
shown if we consider that we are seeking knowl- 
edge of two characteristics of the chemistry 
students. Let the two variates be the mental 
ability as measured by the Otis test and chem- 
istry ability as measured by a standardized 
chemistry test. 

Suppose we are interested in obtaining the 
accuracies 


(7) di= (xi -@! x2 -@?)=(2 3) 


with a confidence of .yo. Furthermore, 


(8) 8. Sea 100 90 
Sij= = and 
S., Sea 90 200 
.016807 -. 007563 
sil . 
~.007563  . 008493 


We notice that the correlation between the 
two variates is taken to be .64, i.e., .64= 





2. Harold Hotelling. "The Generalization of Stuient's Ratio," Annals of Mathematical Statistics, II 


(1931), PPe 360-378. 
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90/¥100 x 200. Since F(2,00, .95) = 2.99 and 


(9) d'siidi) - (2 3) { .016807  .007563\/2 
-.007563 -. 008403} \3 
= , 052, 


we evaluate (6) as n = 2 [1 + 2.99/.052] = 117. 
Substituting F(2, 117; .95) = 3.07 for F(2,o, 
.95) we get n = 120 and ng = 97. If the sample 
size had been determined on each variate indi- 
vidually and independently, then, no, = 83 and 
No2 = 75. Notice that the sample size for mul- 
tivariate sampling is larger than the sample 
size determined for the variates separately. 
This is due to two factors. The first is that the 
variates are correlated and the second is due to 
the fact that the variates have not been evaluat- 
ed from independent samples. 

Let us assume that the variates are independ- 
ent, i.e., their covariance is zero. Then, 


ao) fioo o\ 2.010 .000 
Sij= . su. 
0 200 .000 .005 


and d's'JdJ = . 085, so that n = 76 after the itera- 
tions, and no = 66. 

Thus we have the two independent sampie 
size values 83 and 75, the joint and correlated 
value of 97, and the joint and uncorrelated val- 
ue of 66. These values may be interpreted in 
this manner. If we want to estimate the two var- 
iates with certain accuracies and confidence 
from the same sample of individuals, we must 
take into account the covariance of the variates 
as well as the fact that we do not have two inde- 
pendent samples. The equations tell us that the 
price paid for these circumstances is a sample 
size of 97. If we are willing to take two differ- 
ent samples, we can achieve the same desired 
accuracies and confidence with a sample size 
on one variate of 83 and a sample size on the in- 
dependent sample of another 75 individuals on 
the other variate. The advantage of using one 
sample at a saving of 75 + 83 - 97 = 61 individ- 
uals is clear. Had the variates actually been 
uncorrelated we would have saved 75 + 83 - 66 
= 92 individuals and still achieved the same ob- 
jectives. It might seem surprising that asmall- 
er sample is required for estimating both vari- 
ates simultaneously when they are uncorrelated 
than is required for estimating them separately. 
Recall, however, in one case that the variates 
were assumed to be bivariate normal whereas 
in the other case we were concerned with two 
individual univariate distributions. 

We might be tempted to use 83 individuals 
and evaluate both sample means from these stu- 
dents. However, if the correlation is . 64, then 
we cannot be 95% sure of achieving the simul- 
taneous accuracies. If no correlation exists, 
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then we are more than 95% sure of achieving 
the desired accuracies. Thus our sampling 
scheme is costing more than necessary for the 
original specifications. 

There exists the reasonable notion that the 
Sample size should be invariant to the signs of 
the covariances in the varcovar matrix. Direct 
application of the above formulae will not result 
in such an expectation when some of the covar- 
iances are negative, as well they might be in 
practical applications. To make the formulae 
work, certain adjustments must be made in the 
signs of the elements of di or Sjj. For instance, 
if S,2 = S,, = ~90 in the bivariate example, then 
the new d!S1Jd@) is larger than the value given 
and this results ina sample size of about 30. This 
value is not realistic for this problem. The 
trouble is stemming from the fact that we have 
made the elements of di assume their absolute 
values. This makes no difference when all the 
covariances are positive. If S,, andS,, are 
positive, we should expect the terms d'S'*d# 
and d*S?'d'* to be positive because d'‘ and d’ are 
positive together or negative together onthe av- 
erage in random sampling. If S,, andS,, are 
negative, we should expect the terms d'S'*d? 
and d*$?'d' to be positive since S“ and S* are 
negative and on the average either d' or d® is 
negative. Therefore, cross-product terms are 
subtracted out in the expansion of the quadratic 
form disiid). The sample sizes can be brought 
into agreement if we let d' = (2, -3) or (-2, 3) 
while allowing S,. = Sg, = -90. More simply we 
can allow S,, = Sg, = 90 whether or not they are 
+90 or -90 and proceed with di = (2, 3). 

In general, we can get the appropriate sample 
size by allowing all the elements of Sjj to be pos- 
itive and also letting all elements of d! to be 
positive. Otherwise, under certain conditions, 
we can let elements of Sjj have their real signs 
and let the elements of d! take on the signs of 
the elements of any row or column of Sjj. Since 
the signs of S,, and S,, are both + and S,, and 
Sz are both + when S,, = Sz, = 90, we can let 
r a (2, 3) under either procedure. If we allow 
Si. = Sz, = -90, S,, is+ andS, is -, and S,, 
is - and Sz, is +, therefore we can use either 
di = (2, -3) or di = (-2, 3). 

In this preliminary work, two types of var- 
covar have been found. These have been called 
covariately consistent and covariately inconsis- 
tent. A varcovar is said to be covariately con- 
sistent if the signs of the elements in every row 
or column, when multiplied either by +1 or -1, 
are identically equal to the signs of any other 
row or column. Zero elements are given an 
arbitrary sign. Any varcovar for which the above 
conditions cannot be fulfilled is said to be covar- 
jately inconsistent. A varcovar of order pXp 
has 2P~! covariately consistent forms. Varco- 
vars of orders 1 X 1 and 2 X 2 are always covar- 
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iately consistent while only 4 forms of a 3 x 3 
varcovar and 8 forms of a 4 x 4 varcovar are 
consistent. Consistency ina 3 X 3 varcovar 
would break down if 8,,<0, S,,>0 and S,,>0. 
Such an arrangement as this has been called co- 
variately inconsistent because it is difficult to 
conceive of a self-contained real-variate sys - 
tem which exhibits a true mutual relationship 
such as this. 

For a given 3 X 3 varcovar, the inverses of 
the 4 consistent forms differ among themselves 
only in the signs of the non-main diagonal terms. 
This is also true for the inverses of the incon- 
sistent forms, but the numerical values of the 
two sets of inverses are different. The rule 
mentioned above about determining sample size 
by allowing the elements of Sjj to have their real 
signs and allowing the elements of di to take 
signs identical to those of any row or column is 
appropriate only for covariately consistent var- 
covar. For inconsistent forms about the only 
thing that can be suggested at this point is to 
change all covariances to positive signs and let 
all elements of d! be positive. This rule is sug- 
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gested with some reluctance, but until some 
theory develops it will have to suffice since ap- 
parently inconsistent forms occur inpractice.3 


3. Summary 


The use of repeated and independent tests of 
significance on different variates from the same 
sample is known to be an unwarranted and yet 
popular practice. In problems of estimation, 
estimates of parameters of different variates 
are often made from the same sample without 
regard to the dependence of these estimates. 
This paper shows how appropriate estimates of 
mean values of different variates can be made 
using the same sample observations with a simple 
random sample. An illustration was given for 
the bivariate case although the technique need not 
be restricted to the estimation of the means of 
two variates. The ideas contained in this paper 
can be extended to other more complicated sam- 
pling designs, but the formulae have not yet been 
developed. 





3. AS an example of an inconsistent sample varcovar, see R. A. Fisher, Statistical Methods for Re- 
search Workers, in the section where he discusses the multiple regression analysis of certain 





weather data. 











NOTE ON DISPERSION ANALYSIS 


CHESTER HARRIS 
University of Wisconsin 


ONE OF THE functions of dispersion an- 
alysis is to give a canonical representation of 
group means in one or more statistically signif- 
icant dimensions. In developing this represen- 
tation it is also possible to provide for the cal- 
culation of the Mahalanobis’ D*, or generalized 
distance measure, for each pair of groups, and 
of the linear discriminant scores associated with 
each group. A number of persons have contrib- 
uted to the development of these procedures; the 
review by Hodges (1) and the paper by Tiedeman 
(4) give important facts about the history of this 
development. Uses of dispersion analysis in 
psychology have been illustrated recently in the 
Rao-Slater analysis of neurotic groups (3) and 
by Webster (5). The purpose of this note is to 
comment on the matrices used in dispersion an- 
alysis and to suggest a calculation routine that 
some workers may prefer. Since this routine 
can readily be inferred from Rao’s illustration 
(2; 367-70), it constitutes no new contribution to 
Statistical theory. 

Given, measures on n variables for N persons 
who are assigned to s mutually exclusive groups, 
the matrices B and W may be formed. B is the 
between groups product-sum matrix, with s-l 
degrees of freedom, and W is the within groups 
product-sum matrix withN-sdf. Itis this 
latter df that appears in the equations below. 
Let G designate the n-by-s matrix of group 
means. What is desired is to identify one or 
more rows of a matrix Y, where YG gives the 
canonical representation of the groups. In solv- 
ing the problem it is conventional to assume W 
non-singular, so that the determinantal equation: 


IB-axw|=0 (1) 


may be rewritten as: 
df |Bw-1! - x1] =0 (2) 


for the purpose of calculating the roots. Thede- 
sired rows of Y are determined by solving, for 
each significant root, the homogeneous equation: 


df (BW-1- I) =0 (3) 


and standardizing. The standardization process 
will be made explicit later. A test of signifi- 
cance is used to determine the number of signif- 
icant roots, and ordinarily only the rows of Y 
corresponding to the significant roots are calcu- 





lated. The Rao-Slater and Webster papers may 
be referred to for numerical examples; the Rao- 
Slater analysis also appears in Rao (2). 

For the purpose of this development, let us 
make some simplifying assumptions, namely, 
that B and W both are non-singular and that the 
complete Y is to be determined. Since every 
row of Y satisfies (3), it is necessarily true that: 


YB = DAYW, (4) 


where Dd designates the diagonal matrix of roots 
of (1) and is, by our assumption, non-singular. 
Post-multiplying (4) by YT, the conventional 
transpose of Y, gives: 


YBYT = DAYwyT. (5) 


Now B is symmetric, as is W, and consequently 
both sides of (5) must be symmetric. This im- 
plies that DX and YWYT, both of which are non- 
Singular, must be commutative. Within the non- 
singular multiplicative group, scalars and diag- 
onals, and only scalars and diagonals, are com- 
mutative with any diagonal. It therefore is both 
necessary and sufficient, for (5) to hold, that 
YWYT be either a diagonal or a scalar matrix. 
The standardization process provides the added 
requirement that the elements of the principal 
diagonal of (1/df)YWYT each equal unity. There- 
fore we may write: 


(1/df)YwyT =1 (6) 
and show YTY to be the inverse of (1/df)W, since: 
(1/df)wYTY =1=(1/df)yYTW. = (7) 
The problem of solving for Y is thus formulated 
as a problem of choosing the proper factors of 


the matrix (df)w~1. 


Rao’s transformation (2) on the elements of 
(1/df)W provides a matrix C, such that: 


(1/df)CWCT =I (8) 


CTC = (df)w~! = YTy. (9) 
Therefore we may equate: 
cTQT = yT (10) 
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with Q necessarily orthogonal, since: 
cTatec = yTy =cTc. (11) 


We have shown, then, that YT is merely an or- 
thogonal rotation of CT, and consequently, CG 
is some orthogonal rotation of the canonical rep- 
resentation, YG, where as before G is the ma- 
trix of means of the groups. Our first conclus- 
ion therefore is that for a description of the s 
groups in n space, CG and YG are equally sat- 
isfactory. It is also true that the Mahalanobis’ 
D* values may be calculated directly from CG 
by summing the squares of the differences for 
each pair of columns of CG. 

What has been done is to develop an arbitrary 
set of factors of (df)W~1 by using Rao’s trans- 
formation; it is now necessary to determine the 
transformation from CT to YT. Form the ma- 
trix CBCT;; its trace is the sum of its roots and 
may be tested for overall group differences as 
an alternative to Wilks’ A criterion (Rao, 2; 372). 
Solve sequentially for the characteristic vectors 
and roots of CBCT. As each root is extracted, 
the test for residual variation may be made in 
order to determine the number of significant di- 
mensions. Associated with each significant root 
of CBCT is a normalized characteristic vector 
that forms a column of QT; consequently, the 
linear weights in the transformation of CT to YT 
are developed in this sequential process. Or- 
dinarily one would terminate the process when 
the residual variation is no longer significant 
and thus calculate only the rows of Y associated 
with significant roots. 

It can be shown that the roots of (df)BW~! are 
given by this procedure. Using (9), 


(df)BW~1 = BCTC, (12) 
Since Y satisfies (3) and is non-singular, 
YBCTCY~! = Dy (13) 


where Dx is the roots of (1) multiplied by the 
scalar (df). Then: 


BCTCY~! = y~1D, (14) 


CBCT(Cy~!) = (CY~!)D, (15) 


which shows (CY~!) to be columns of character- 
istic vectors of CBCT, whose roots are Dx. 

Since CBCT is symmetric, (CY~!) may be taken 
as an orthogonal matrix, say, QT, and we have: 


cT@T .cTcy-l=yTyy-l=yT, (16) 


as required. 
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The restriction that B be taken as non-singu- 
lar may now be removed. With W non-singular, 
both C and Y are non-singular, and the complete 
Y exists in the sense of (9). We, therefore, per- 
mit Dx in (13) to be singular, that is, to have 
one or more zero roots as entries in the princi- 
pal diagonal. Equation (15) is still valid how - 
ever, since every symmetric matrix whether 
singular or non-singular is orthogonally similar 
to a diagonal matrix of its latent roots. The ev- 
ident modification is that those columns of (CY~!) 
corresponding to zero roots would not be of in- 
terest, and consequently not be calculated. As 
equation (16) indicates, this does not interfere 
with determining the desired rows of Y. 

To remove the restriction that W be non-sing- 
ular is obviously much more difficult and prob- 
ably not very useful, since one ordinarily would 
want the within groups product-sum matrix to be 
of rank equal to the number of variables. How- 
ever, it seems likely that this restriction might 
be removed. Rao’s transformation is effective 
with singular matrices, yielding a singular C; 
then the problem becomes one of solving for the 
roots and characteristic vectors of the singular 
CBCT as before. The algebra would have to 
be modified to establish this, and it will not be 
attempted here. 


Summar 


The procedure suggested by this analysis 
might take these steps: 


a) Compute B, W, and G from the given data. 

b) Use Rao’s transformation to form C, such 
that (1/df)CWCT = I. 

c) Compute CG. This gives an orthogonal ro- 
tation of the complete canonical representa- 
tion of the group means. Plots may be made 
in order to assist in viewing the configura- 
tion of means; Thurstone’s extended-vector 
representation is useful when n is large. 
Compute the Mahalanobis’ D* measures, if 
desired, from the data of CG. Tests of sig- 
nificance for distances between groups are 
available and may be applied. 

Compute CBCT and test the trace for signif- 
icance as an alternative to Wilks’ A criter- 
ion. 

Extract characteristic vectors and roots of 
the symmetric matrix CBCT so long as the 
residual variation is significant. Use these 
characteristic vectors with unit variance to 
transform C to Y and consequently CG to the 
canonical representation YG. 

If the linear discriminant scores for the 
groups are desired, the coefficients are giv- 
en by CTCG, since CTC = (df)w~!. 


The main advantage of this procedure is that it 
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gives the configuration of group means at a rel- 
atively early stage and with relatively little cal- 
culation; this may be an important consideration 
in making explorations of new data. 
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CONSTANCY OF RORSCHACH COLOR RE- 
SPONSES UNDER EDUCATIONAL 
CONDITIONING 


JANET E. BLECHNER 
Berkeley City Schools 
Berkeley, California 


I Introduction 


IN A PREVIOUS article describing an 
experiment on the constancy of the Rorschach 
movement responses, the writer outlined com- 
mon conceptions concerning the role of the 
Rorschach in clinical usage. The point was 
stressed that the personality descriptions result- 
ing from the administration of the test were 
treated as representing basic personality struc- 
ture of the subject upon which daily experimental 
phenomena made somewhat narrow ripples ona 
deeper stream. Because of the ever-increasing 
use of the Rorschach device in this manner it 
becomes more than ever necessary to attempt 
to determine the intrinsic validity of the scoring 
factors upon which these personality descrip- 
tions are based. 


II The Problem 





The experiment herein described constituted 
an extension of the original test of constancy of 
the movement scores. The writer attempted to 
ascertain whether the color responses (FC + 
CF + C) would increase with educational c o ndi- 
tioning designed to influence such an increase. 


III Method and Procedure 





Two groups of students in the beginning edu- 
cational psychology course at the University of 
California at Berkeley in the fall of 1951 were 
designated as subjects for the experiment. One 
class acted as the experimental group, another 
as the control group. Case numbers were as 
follows: 


Experimental group 83 
Control group 96 


Both classes took the group version of the 
Rorschach twice, with an interval of about a 
month ensuing between the two administrations. 
The test records utilized for the experimental 
group were those only of students whose atten- 
dance at the conditioning session was attested 
to by an attendance record. This record con- 
sisted of a test blank for the Ishiharatest of 
color-blindness, with the record serving the 
dual purpose of eliminating from the study those 





who were color-blind, as well as those who were 
not present. 

Administration of the test and method of re - 
cording responses were performed ac cording to 
the technique outlined by Harrower-Erickson. 
The only variation of the method occurred in the 
administration of the repeat test, when the sub- 
jects were instructed neither to attempt to re- 
member specifically what their earlier responses 
had been nor to try deliberately to give responses 
different from the previous ones. They were en- 
couraged to try to respond to the slides as if they 
had never seen them before, recording what they 
now saw. 

Scoring, by the Klopfer method, was perform- 
ed by the examiner for the original test andfor 
the second by a clinical psychologist having no 
interest in nor connection with the study whatso- 
ever. This scorer also made a spotcheck per-~ 
usal of the examiner’s scoring, the aim being to 
secure as high a degree of objectivity in scoring 
as possible. 

For the conditioning procedure two different 
devices were utilized. The first consisted of 
lecture material in which the instructor discussed 
visual color-perception theories and the signifi- 
cance of color associations in the psychology of 
perception. He mentioned also, for example, the 
psychological origins of acceptance of certain 
colors as cool and others as warm, 

In addition to the lecture material, the instruc- 
tor presented a set of slides which had been con- 
structed by the writer. These were composed of 
brightly and variously colored bits of odd-shaped 
gelatin papers, some of them shapedin a way that 
would suggest an association with the color. A 
green bit of paper, for example, might be shaped 
like grass, with the objective that the student 
should be led to the association, ‘‘green grass.’’ 
The slides were explained as an attempt at con- 
struction of a new test, and the students were 
urged to assist the project by listing as many of 
the paired form and color associations as possible. 
Actually, it was believed that this constituted prac- 
tice in responding to color, and the question was 
whether or not the conditioning would result in 
heightened awareness of the color on the second 
Rorschach test. For the control group, of course, 
there was no conditioning activity of any kind. In 
both groups there was, as well, no mention of the 
Rorschach test except during the test administra- 
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TABLE I 


MEANS AND STANDARD DEVIATIONS OF NUMBERS OF 
RESPONSES ON TEST AND RE-TEST USING THE 
GROUP RORSCHACH 


(Data from Two Groups of College Students) 





Group | Group II 
N = 83 N = 96 





Standard Deviations 
Test I 
Test I 
Gain 





TABLE 0 


RESPONSES IN THE COLOR CATEGORY UPON TEST 
AND RE-TEST, EXPRESSED AS ACTUAL NUMBER 
OF RESPONSES 


(Data from Two Groups of College Students) 





Group I Group II 
N = 83 N = 96 





Means 
Test I 
Test 


Gain 

Standard Deviations 
Test I 
Test I 


Critical Ratio 
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tion itself. 
IV Results 


In both groups, there was an expected in- 
crease in the total number of responses to the 
second test, but the increase tended to c onsist 
largely of form-dominated responses to detail 
areas, a phenomenon noted also by Harrower- 
Erickson in her original standardization of the 
group method of administration. : 

The number of color responses increased in’ 
the second test by a mean of 1. 95 responses for 
the experimental group and 1.11 for the control 
group. 

Despite any apparently greater increase in 
the color-conditioned experimental group, how- 
ever, critical ratios of the difference in the 
means points to the significance of the increase 
for both groups. This evidence suggests the con- 
clusion that an increase in color responses is to 
be found upon repeat administration of the test, 
but the evidence does not point conclusively to 
any accomplishment of this result by virtue of 
the conditioning. Were this to be assumed, it 
would have been necessary to discover a signifi- 
cant increase in the conditioned group but none 
in the control group. It may be theorized, then, 
that upon repetition of the test there is a relax- 
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ation by the subjects of their intellectual control 
and greater release of the affectivity which the 
color responses are presumed to indicate, or that 
familiarity with the test leads to greater response 
to the color. It must be assumed, therefore, 
that the color factor on the Rorschach does not 
refer to a stable personality element but to one 
which can be affected in as yet unknown ways. 


V Summary and Conclusions 





Two classes of beginning educational psychol- 
ogy students at the University of California were 
utilized as experimental and control groups for 
an experiment to test the influence of education- 
ai conditioning on the color responses to the Ror- 
schach test. Both groups took the Rorschach 
twice at an interval of about one month. Prior 
to the repeat test in the experimental group the 
instructor presented lecture material, the Ishi- 
hara test of color-blindness, and examiner- 
constructed slides designed to increase the sub- 
jects’ experience at formulation of color-domin- 
ated concepts. Results showed a significant in- 
crease in number of color responses in both 
groups, but the increase could not be ascribed 
to the conditioning, since it appeared in both 
groups. Intrinsic validity of the color factor on 
the Rorschach, therefore, is open to question. 











“PSYCHOLOGICAL” CORRECTION 
FOR CHANCE 


JULIAN C. STANLEY * 
University of Wisconsin 


CORRECTING TEST scores for ‘‘guess- 
ing’’ is tantamount to correcting them for differ- 
ing numbers of unanswered items among the 
testees. If 09, the standard deviation of the 
number of items omitted, is zero, then each 
person’s z-score is unaffected by the usual cor- 
rection for chance success. Therefore, if each 
testee leaves blank the same number of.items 
as every other testee, the correlation rpg be - 
between ‘‘rights’’ scores (R) and corrected 
scores 


_p- —w 
(s=R-—W_) 


will be + 1.00. Here W is the individual’s 
‘‘wrongs’’ score and c the number of choices 
(options) each item has. 

That the correction for chance is not needed 
when all testees answer all items was shown in 
1924 by Holzinger (5) and later re-stated in var- 
ious places (e.g., 2, p.271; 1; 3, p. 70; 6, p. 
164). Perhaps the most adequate brief discus- 
sion is by Gulliksen (4, pp. 245-251), whostates 
that: 


.... there is no reason for considering 
any of these {corrections for chance] 
formulas if, for most of the people, R 
+ W is essentially equal to the total 
number of items in tne test. Such form- 
ulas are to be used if, and only if, the 
number of unmarked items. .. . is fairly 
large for some persons, and fairly 
small for others. (p. 248) 


In the following paragraphs the writer pre- 
sents a proof that when 09=0, each testee’s 
z-score is unchanged by correction for 
chance—in other words, for this special situ- 


ation z (R- . w ) = Zp. He also suggests 


that even when a correction for chance is use- 
less from the statistical standpoint, it may 
nevertheless be warranted by a certain attitude- 
effect produced on college students, once the log- 
ic of the correction formula is understood by 
them. 


The formula S$ = R- —_ may be written 
c — 





S= (_f_)R- 2-0, where nis the total 
c-1l c-l1 ais 
number of items on the test (n = R + W + O for 
each testee). If O is the same for each exam- 
inee, S is a linear function of R, so rpg = 1. 
Thus when 0, is approximately zero, it appears 
to be a waste of time to correct raw scores for 
chance, since the correction will not alter any 
standard scores. 

This can be proved rigorously by starting 
with the basic formula for a z-score, making 
the correction for chance in the general case 
where 06 = 0, simplifying the ensuing formula, 
and then showing that when O does not vary from 
testee to testee the formula reduces to the orig- 
inal z uncorrected for chance. Begin with the 


usual formula, 
oe. ASE 
> ae 


where R is the mean ‘‘rights’’ score. Then for 
any one of N testees, 


N 
wR ee) 


R- 








(1) 
°(R - W_) 
c-1 


It can be shown rather easily that the mean 
corrected for chance, 


N 

= (R- —W_) 

1 c-1 , equals (_c¢_)R- —1 
c-1l c-1l 


(n-0) 





N 
or SR+O-—n | Also, the standard deviation 
corrected for chance, %CR- WwW ) equals 

c-1 





a vero, + % + 2crpo%R% - Substituting 
these values for the corrected mean and cor- 
rected standard deviation in formula (1), we 
obtain after further simplifications 


c(R - R) + (O - 9) (2) 


vcao4, + % + 2crpo"R’0 











#The author is indebted to William J. McDrath for bibliographic assistance. 
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If all O’s are zero or, more generally, if 
99= 9, formula (2) becomes 


z'= c(R - R) R-R = Z. 
Vesa, oR 





While there is no psychometric necessity for 
the chance correction when the number of omits 
varies little from testee to testee, there may 
be an excellent psychological reason. This is 
best illustrated by the true-false or two-option 
multiple choice test, where c = 2. Those exam- 
inees who are totally ignorant concerning the 
test material and who make sheer guesses on 
all n items will tend to ‘‘earn’’ scores of n/2 
unless a correction for guessing is utilized. The 
average rights score of these uniformed testees 
will be about 50% of the total possible score, 
though their actual knowledge is, by definition, 
0%. When omits are negligible, correcting all 
scores for chance raises their standard devia- 
tion from 


oR to (-S-) oR: 


Thus the increase ranges from 100% for two- 
option items to 25% for those with five options. 
Also, the mean drops from R to (cR - n)/(c - 1). 
The writer has found that some college stud- 
ents tend to be resentful of low grades based up- 
on rights scores uncorrected for chance, even 
though their ranks in class are very low, but 
that when the chance correction is employed 
such complaints occur much less frequently. 
Therefore, correcting for chance may be im- 
portant as a grading factor, even when from a 
strict measurement standpoint it is useless. If 





(Vol. XXII 


there are very few omits, a relatively simple 
correction formula is 


_,¢ — 
8=G-7)* c-1 


If R=n, thenS=n. If R=n- 1, thenS=n- 
ec. I R=n~- 2, then 

e-}i 

c 

c-}] 


c 


S8=(n- -, 


c 
This constant subtractive factor of c/(c - 1) great- 
ly simplifies the process of finding corrected 
scores, especially if an electric calculating ma- 
chine is available. 
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ERRATA 


The author, Clifford M. Christensen, wishes to correct three errors that appeared 
in his article, ‘‘Multivariate Statistical Analysis of Differences Between Pre-pro- 
fessional Groups of College Students, ’’ in the March 1953 issue of the Journal of 
Experimental Education, pp. 221-232: 





k. p. 223, formula (3) should read: 
xX? =- {n- 1/2(p + q+ 1)} loge V 

2. p. 229, formula (8) should read: 
L, = 1,X, + 1gXq + 15X3 + 1yXq - 1/2 1;Xij + loge TT; 
where 1j = {aih} { Xij } 

3. p. 230, constant terms in Table X should read: 


4 
~ 1/2z, 1;Xjj + loge TT; 


i 


In the December, 1953, Journal of Experimental Education, on page 150 of the 
article ‘‘A Simple Course Evaluation Scale, ’’ by Ralph Mason Dreger, the line— 
*Both ratios are significant at the 1% level—should have been included under 


Table I. 
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